# Create RO-Crate from RiOMar dataset


## Context

### Purpose

We are showing how to create a RO-Crate for a dataset using the `rocrate` python library. This is a simple example with no specific RO-Crate profile. It follows RO-Crate v 1.1 specification.

- **Standardized Metadata Packaging**: RO-Crates provide a standardized way to bundle datasets with rich metadata, making it easier to understand, share, and reuse the data.
- **Enhanced FAIRness**: By including machine-readable metadata, RO-Crates improve the Findability, Accessibility, Interoperability, and Reusability (FAIR) of the dataset.
- **Improved Discoverability**: Metadata in an RO-Crate allows datasets to be easily indexed and discovered through search engines and data repositories.
- **Documentation and Provenance**: RO-Crates document essential information about the dataset, such as its source, authorship, and creation process, ensuring transparency and traceability.
- **Facilitates Integration**: The structured metadata makes it easier to integrate the dataset with other tools, workflows, or datasets, enhancing its usability.
- **Compliance with Standards**: Many funding agencies and journals now require datasets to be published with detailed metadata. RO-Crates align with these expectations and promote best practices in data management.


### Description

In this notebook, we will learn how to create a simple RO-Crate from the RiOMar data. We will then identify any missing metadata that needs to be added to the original dataset's metadata.

## Contributions

### Notebook

- Anne Fouilloux (author), Simula Research Laboratory (Norway), @annefou
- XX (reviewer)

## Biblipgraphy and other interesting resources

- [rocrate](https://pypi.org/project/rocrate/) Python package
- [Research Object documentation](https://www.researchobject.org)

## Install and Import libraries

In [42]:
pip install rocrate rocrateValidator

Collecting rocrateValidator
  Downloading rocrateValidator-0.2.15-py3-none-any.whl.metadata (228 bytes)
Downloading rocrateValidator-0.2.15-py3-none-any.whl (11 kB)
Installing collected packages: rocrateValidator
Successfully installed rocrateValidator-0.2.15
Note: you may need to restart the kernel to use updated packages.


In [2]:
import requests
import json
from rocrate.rocrate import ROCrate
from rocrate.model.person import Person
import pandas as pd
from datetime import datetime
import geopandas
import shapely
import xarray as xr
import numpy as np
import s3fs

## Open RiOMar data to get metadata

In [3]:
url_data = "https://data-fair2adapt.ifremer.fr/riomar/small.zarr"

In [4]:
ds = xr.open_zarr(url_data)
ds

Unnamed: 0,Array,Chunk
Bytes,4.65 MiB,4.65 MiB
Shape,"(838, 727)","(838, 727)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 4.65 MiB 4.65 MiB Shape (838, 727) (838, 727) Dask graph 1 chunks in 2 graph layers Data type float64 numpy.ndarray",727  838,

Unnamed: 0,Array,Chunk
Bytes,4.65 MiB,4.65 MiB
Shape,"(838, 727)","(838, 727)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.65 MiB,4.65 MiB
Shape,"(838, 727)","(838, 727)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 4.65 MiB 4.65 MiB Shape (838, 727) (838, 727) Dask graph 1 chunks in 2 graph layers Data type float64 numpy.ndarray",727  838,

Unnamed: 0,Array,Chunk
Bytes,4.65 MiB,4.65 MiB
Shape,"(838, 727)","(838, 727)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,40 B,8 B
Shape,"(5,)","(1,)"
Dask graph,5 chunks in 2 graph layers,5 chunks in 2 graph layers
Data type,datetime64[ns] numpy.ndarray,datetime64[ns] numpy.ndarray
"Array Chunk Bytes 40 B 8 B Shape (5,) (1,) Dask graph 5 chunks in 2 graph layers Data type datetime64[ns] numpy.ndarray",5  1,

Unnamed: 0,Array,Chunk
Bytes,40 B,8 B
Shape,"(5,)","(1,)"
Dask graph,5 chunks in 2 graph layers,5 chunks in 2 graph layers
Data type,datetime64[ns] numpy.ndarray,datetime64[ns] numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,594.95 kiB,594.95 kiB
Shape,"(838, 727)","(838, 727)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,bool numpy.ndarray,bool numpy.ndarray
"Array Chunk Bytes 594.95 kiB 594.95 kiB Shape (838, 727) (838, 727) Dask graph 1 chunks in 2 graph layers Data type bool numpy.ndarray",727  838,

Unnamed: 0,Array,Chunk
Bytes,594.95 kiB,594.95 kiB
Shape,"(838, 727)","(838, 727)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,bool numpy.ndarray,bool numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,464.80 MiB,92.96 MiB
Shape,"(5, 40, 838, 727)","(1, 40, 838, 727)"
Dask graph,5 chunks in 2 graph layers,5 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 464.80 MiB 92.96 MiB Shape (5, 40, 838, 727) (1, 40, 838, 727) Dask graph 5 chunks in 2 graph layers Data type float32 numpy.ndarray",5  1  727  838  40,

Unnamed: 0,Array,Chunk
Bytes,464.80 MiB,92.96 MiB
Shape,"(5, 40, 838, 727)","(1, 40, 838, 727)"
Dask graph,5 chunks in 2 graph layers,5 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


## Get metadata from RiOMAR

### Get the title


In [5]:
title = ds.attrs["name"]

### Need to have better description available in the metadata. It could be constructed from the metadata if metadata were better constructed

In [6]:
description = "RiOMar dataset " + title 

### Get bounding box in WKT
- Latitudes with values of -1 are NaN

In [7]:
minlat = ds.nav_lat_rho.where(ds.nav_lat_rho > -1, np.nan).min().values
maxlat = ds.nav_lat_rho.max().values
minlon = ds.nav_lon_rho.min().values
maxlon = ds.nav_lon_rho.max().values
print(minlat, maxlat, minlon, maxlon)

43.285 50.867471190931404 -8.0 1.6800000000000015


In [8]:
geometry_wkt = shapely.geometry.box(minlon, minlat, maxlon, maxlat).wkt
geometry_wkt

'POLYGON ((1.6800000000000015 43.285, 1.6800000000000015 50.867471190931404, -8 50.867471190931404, -8 43.285, 1.6800000000000015 43.285))'

- time range 

In [9]:
ts = pd.to_datetime(str(ds.time_counter.min().values)) 
te = pd.to_datetime(str(ds.time_counter.max().values)) 
date_start = ts.strftime('%Y.%m.%d')
date_end = te.strftime('%Y.%m.%d')
date_start, date_end

('2004.01.01', '2004.01.01')

- Creation date (we assume `timeStamp` contains this information (TBC)

In [10]:
dateCreated = ds.attrs["timeStamp"]
dateCreated

'2024-Apr-01 10:49:18 GMT'

In [11]:
from datetime import date

today = date.today().strftime('%Y.%m.%d')
print("Today's date:", today)

sdDatePublished =  today # could be the date corresponding to the creation of the DOI (publishing)
dateModified =  today # could be the date of creation of the DGGS regridded data e.g. it needs to be added to Zarr when regridding

Today's date: 2025.01.19


### Get the size of the dataset
- We usually can get this information from the metadata (needs to be added)

In [12]:
contentSize = 0 # We need to get the total size in bytes

### Get the persistent identifier
- Dataset should have a persistent identifier e.g. DOI (currently it does not have one)


In [13]:
doi_data = "NONE" # it is a problem

### StudySubject and keywords

- StudySubject and keywords

In [14]:
studySubject_urls = [ "http://inspire.ec.europa.eu/metadata-codelist/TopicCategory/environment"]
keywords = ["riomar", "croco"]

### Version of the dataset

In [15]:
version_data = "1.0"

### Prepare information for the provenance

In [16]:
prov = {
      "@id": "https://doi.org/10.5281/zenodo.13898339",
      "@type": "SoftwareApplication",
      "url": "https://www.croco-ocean.org",
      "name": "CROCO, Coastal and Regional Ocean COmmunity",
      "version": "CROCO GAMA model v2.0.1 https://doi.org/10.5281/zenodo.13898339"
}

## Create a new RO-Crate

In [17]:
crate = ROCrate()

## Add the license for the RO-Crate

- The license of the Research Object (RO-Crate) may not be the same as the licenses of the data bundled in the RO-Crate.
- Our RO-Crate is open and distributed under [CC-BY-4](https://creativecommons.org/licenses/by/4.0/) license.
- The content of the license needs to be a URL (here `https://creativecommons.org/licenses/by/4.0/`)

In [18]:
RO_license_id = "CC-BY-4.0"
RO_license_url = "https://creativecommons.org/licenses/by/4.0/"
RO_license_title = "Creative Commons Attribution 4.0"

### Add the selected license to the RO-Crate

In [19]:
crate.update_jsonld(
{
    "@id": "./",
    "license": { "@id":  RO_license_url},
})
license = {
                "@id": RO_license_url,
                "@type": "CreativeWork",
                "name": RO_license_id,
                "description": RO_license_title,
                }
crate.add_jsonld(license)

<https://creativecommons.org/licenses/by/4.0/ CreativeWork>

## Add creators and their Organizations

- you need to add here the list of creators of the RO-Crate 
- you can go to `https://ror.org` and search for the organisation you would like to add. In this notebook, we create this information "manually" but it can be better streamlined in the future (for instance using [Rohub](https://rohub.org")
- You may have several authors and would need to add them in the RO-Crate following the same approach.

### Add Persons and organisations

In [20]:
list_authors = []

In [21]:
organisation_1 = {
    "name": "Simula Research Laboratory",
    "id": "https://ror.org/00vn06n10",
    "url" : "https://www.simula.no"
}
creator_1 = {
    "id": "https://orcid.org/0000-0002-1784-2920", # The id is the ORCID of the author
    "email": "annef@simula.no",
    "givenName": "Anne", 
    "familyName": "Fouilloux", 
    "affiliation": {"@id": organisation_1["id"]}
    
}
creator_1

{'id': 'https://orcid.org/0000-0002-1784-2920',
 'email': 'annef@simula.no',
 'givenName': 'Anne',
 'familyName': 'Fouilloux',
 'affiliation': {'@id': 'https://ror.org/00vn06n10'}}

In [22]:
organisation_2 = {
    "name": "Ifremer",
    "id": "https://ror.org/044jxhp58",
    "url" : "https://www.ifremer.fr"
}
creator_2 = {
    "id": "https://orcid.org/0000-0002-1500-0156", # The id is the ORCID of the author
    "email": "tina.odaka@ifremer.fr",
    "givenName": "Tina Erica", 
    "familyName": "Odaka", 
    "affiliation": {"@id": organisation_2["id"]}
    
}
creator_2

{'id': 'https://orcid.org/0000-0002-1500-0156',
 'email': 'tina.odaka@ifremer.fr',
 'givenName': 'Tina Erica',
 'familyName': 'Odaka',
 'affiliation': {'@id': 'https://ror.org/044jxhp58'}}

In [23]:
list_orcids = [ creator_1["id"], creator_2["id"]]
list_orcids

['https://orcid.org/0000-0002-1784-2920',
 'https://orcid.org/0000-0002-1500-0156']

### Adding all the authors

In [24]:
list_authors.append(creator_1['givenName'] + " " +  creator_1['familyName'])
list_authors.append(creator_2['givenName'] + " " +  creator_2['familyName'])
list_authors

['Anne Fouilloux', 'Tina Erica Odaka']

Add the 2 creators as Person in the RO-Crate

In [25]:
crate.add(Person(crate, creator_1.pop("id"), properties=creator_1))
crate.add(Person(crate, creator_2.pop("id"), properties=creator_2))

<https://orcid.org/0000-0002-1500-0156 Person>

Add the list of authors in the RO-Crate

In [26]:
crate.update_jsonld({
    "@id": "./",
    "author": list_orcids,
})

<./ Dataset>

### Add information about data bundled in the RO-Crate

#### Prepare Temporal coverage if available

In [27]:
temporal_coverage = date_start + "/" + date_end
temporal_coverage

'2004.01.01/2004.01.01'

### Prepare Spatial coverage if available

In [28]:
def get_geoshape(geometry):
    # We assume wkt geometry
    geo = shapely.wkt.loads(geometry)
    if hasattr(geo, 'geoms'):
        # take the first one
        geo = geo.geoms[0]
    geo = geo.wkt.replace("POLYGON", "").replace("(","").replace(")","").strip()   
    geolocation = { "@type": "GeoShape", "@id": geo, "polygon": geo}
    return geolocation


geolocation = get_geoshape(geometry_wkt)
geolocation

{'@type': 'GeoShape',
 '@id': '1.6800000000000015 43.285, 1.6800000000000015 50.867471190931404, -8 50.867471190931404, -8 43.285, 1.6800000000000015 43.285',
 'polygon': '1.6800000000000015 43.285, 1.6800000000000015 50.867471190931404, -8 50.867471190931404, -8 43.285, 1.6800000000000015 43.285'}

### Go through each data and add it in the RO-Crate 
- In this example we only add one dataset

In [29]:
properties =  {
    "modified_date": dateModified, 
    "name": url_data, 
    "location": geolocation,
    "temporalCoverage": temporal_coverage, 
    "sdDatePublished": sdDatePublished, 
    "dateCreated": dateCreated, 
    "dateModified": dateModified, # could be the date of creation of the DGGS regridded data
###    "contentSize": contentSize,  TBC
    "encodingFormat": ' text/html; charset=us-ascii '
}

print("properties = ", properties)

resource = crate.add_file(url_data, fetch_remote = False, properties=properties)

properties =  {'modified_date': '2025.01.19', 'name': 'https://data-fair2adapt.ifremer.fr/riomar/small.zarr', 'location': {'@type': 'GeoShape', '@id': '1.6800000000000015 43.285, 1.6800000000000015 50.867471190931404, -8 50.867471190931404, -8 43.285, 1.6800000000000015 43.285', 'polygon': '1.6800000000000015 43.285, 1.6800000000000015 50.867471190931404, -8 50.867471190931404, -8 43.285, 1.6800000000000015 43.285'}, 'temporalCoverage': '2004.01.01/2004.01.01', 'sdDatePublished': '2025.01.19', 'dateCreated': '2024-Apr-01 10:49:18 GMT', 'dateModified': '2025.01.19', 'encodingFormat': ' text/html; charset=us-ascii '}


## Add metadata to RO

### Add the title and description

In [30]:
crate.update_jsonld({
    "@id": "./",
    "description": description,
    "title": title,
    "name": title,
})

<./ Dataset>

### Add the publisher and creator

In [31]:
publisher_name = "Sigma2 AS"
publisher_url = "https://www.wikidata.org/wiki/Q12008197"
publisher = {
                "@id": publisher_url,
                "@type": "Organization",
                "name": publisher_name,
                "url": publisher_url
                }
crate.add_jsonld(publisher)
crate.update_jsonld(
{
    "@id": "./",
    "publisher": { "@id": publisher_url },
})

<./ Dataset>

### Add the creator of the RO-Crate

In [32]:
crate.update_jsonld(
{
    "@id": "ro-crate-metadata.json",
    "creator": { "@id": publisher_url },
})

<ro-crate-metadata.json CreativeWork>

### Add Publication date

In [33]:
date_published =  datetime.strptime(sdDatePublished, "%Y.%m.%d")

crate.update_jsonld({
    "@id": "./",
    "datePublished":  date_published.strftime("%Y-%m-%d") ,
})

<./ Dataset>

### Add citation

In [34]:
doi = "https://doi.org/" + doi_data
cite_as = " and ".join(list_authors) + ", " + title + ", " + publisher_name + ", " + date_published.strftime("%Y") + ". " +  doi_data + "."

crate.update_jsonld({
    "@id": "./",
    "identifier": doi_data,
    "url": doi_data,
    "cite-as":  cite_as ,
})


<./ Dataset>

### Add studySubject, keywords, etc.

The studySubject is from `http://inspire.ec.europa.eu/metadata-codelist/TopicCategory/`.
Go to the URL and select the studySubject that is most relevant for your data

In [35]:
study_subjects = []
for subject_url in studySubject_urls:
    study_subjects.append({
         "@id": subject_url
    })
study_subjects

[{'@id': 'http://inspire.ec.europa.eu/metadata-codelist/TopicCategory/environment'}]

In [36]:
keywords = ", ".join(keywords)
keywords

'riomar, croco'

In [37]:
crate.update_jsonld({
    "@id": "./",
    "about": study_subjects,
    "keywords":  keywords,
})

<./ Dataset>

### Add version

In [38]:
crate.update_jsonld({
    "@id": "./",
    "version": version_data,
})

<./ Dataset>

### Add Language

In [39]:
#crate.update_jsonld({
#    "@id": ,
#    "@type": "Language",
#})

## Write to disk

In [40]:
crate.write("ro-crate")

In [43]:
from rocrateValidator import validate as validate

In [44]:
v = validate.validate("ro-crate")
v.validator()

This is an INVALID RO-Crate
{
    "File existence": [
        true
    ],
    "File size": [
        true
    ],
    "Metadata file existence": [
        true
    ],
    "Json check": [
        true
    ],
    "Json-ld check": [
        true
    ],
    "File descriptor check": [
        true
    ],
    "Direct property check": [
        true
    ],
    "Referencing check": [
        true
    ],
    "Encoding check": [
        true
    ],
    "Web-based data entity check": [
        false,
        "Semantic Error: Invalid ID at https://data-fair2adapt.ifremer.fr/riomar/small.zarr. It should be a downloadable url"
    ],
    "Person entity check": [
        true
    ],
    "Organization entity check": [
        true
    ],
    "Contact information check": [
        true
    ],
    "Citation property check": [
        true
    ],
    "Publisher property check": [
        true
    ],
    "Funder property check": [
        true
    ],
    "Licensing property check": [
        false,
       