# Organisation/Funder/Repository Data Management Plans statistics

Data management plans (DMPs) are documents accompanying research proposals and project outputs. DMPs are created as textual narratives and describe the data and tools employed in scientific investigations.They are sometimes seen as an administrative exercise and not as an integral part of research practice. Machine Actionable DMPs (maDMPs) take the DMP concept further by using PIDs and PIDs services to connect all resources associated with a DMP.


This notebook displays all DMP statistics for an organisation, funder and/or data repository. By the end of this notebook, you will be able to succinctly display all the DMPs statistics for an organization, a funder and a repository. To demonstrate this we use the **California Digital Library**  as Organization (https://ror.org/03yrm5c26) and the ** European Commision** as Funder (https://doi.org/10.13039/501100000780). In the summary statistics you will find a row for each DMP of the EC. Each row includes the title of the DMP, the PID, number of datasets and related publications, people involved, organizations and funders.


The process of displaying the DMP statistics is very simple. First, and after an initial setup, we fetch all we need from the DataCite GraphQL API. Then, we transform this data into a data structure that can be used for computation. Finally, we take the data transformation and supply it to a table.




In [1]:
import ipywidgets as widgets
f = widgets.Dropdown(
    options=['https://ror.org/00k4n6c32', 'https://ror.org/03yrm5c26'],
    value='https://ror.org/03yrm5c26',
    description='Choose Organisation:',
    disabled=False,
)
f

Dropdown(description='Choose Organisation:', index=1, options=('https://ror.org/00k4n6c32', 'https://ror.org/0…

In [2]:
%%capture
# Install required Python packages
!pip install dfply

In [3]:
import json
import pandas as pd
import numpy as np
from dfply import *


In [4]:
# Prepare the GraphQL client
import requests
from IPython.display import display, Markdown
from gql import gql, Client
from gql.transport.requests import RequestsHTTPTransport

_transport = RequestsHTTPTransport(
    url='https://api.datacite.org/graphql',
    use_json=True,
)

client = Client(
    transport=_transport,
    fetch_schema_from_transport=True,
)

## Fetching Data

We obtain all the data from the DataCite GraphQL API.


In [5]:
 # Generate the GraphQL query to retrieve up to 100 outputs of University of Oxford, with at least 100 views each.
query_params = {
    "rorId" : f.value,
    "funderId" : "https://doi.org/10.13039/501100000780",
    "repositoryId" : "cern.zenodo"
}

organizationQuery = gql("""query getOutputs($rorId: ID!)
{
  organization(id: $rorId) {
    name
    dataManagementPlans(first: 50) {
      totalCount
      nodes {
        id
        title: titles(first: 1) {
          title
        }
        datasets: citations(query:"types.resourceTypeGeneral:Dataset") {
          totalCount
        }
        publications: citations(query:"types.resourceTypeGeneral:Text") {
          totalCount
        }
        hostingInstitution: contributors(contributorType: "HostingInstitution") {
          id
          title: name
        }
        producer: contributors(contributorType: "Producer") {
          id
          title: name
        }
        funders: fundingReferences {
          id: funderIdentifier
          funderIdentifierType
          title: funderName
        }
        people: creators {
          id
          name
        }
        contributors {
          id
          name
        }
      }
    }
  }
}
""")

funderQuery = gql("""query getOutputs($funderId: ID!)
{
  funder(id: $funderId) {
    name
    dataManagementPlans(first: 50) {
      totalCount
      nodes {
        id
        title: titles(first: 1) {
          title
        }
        datasets: citations(query:"types.resourceTypeGeneral:Dataset") {
          totalCount
        }
        publications: citations(query:"types.resourceTypeGeneral:Text") {
          totalCount
        }
        hostingInstitution: contributors(contributorType: "HostingInstitution") {
          id
          title: name
        }
        producer: contributors(contributorType: "Producer") {
          id
          title: name
        }
        funders: fundingReferences {
          id: funderIdentifier
          funderIdentifierType
          title: funderName
        }
        people: creators {
          id
          name
        }
        contributors {
          id
          name
        }
      }
    }
  }
}
""")

repositoryQuery = gql("""query getOutputs($repositoryId: ID!)
{
  repository(id: $repositoryId) {
    name
    dataManagementPlans(first: 50) {
      totalCount
      nodes {
        id
        title: titles(first: 1) {
          title
        }
        datasets: citations(query:"types.resourceTypeGeneral:Dataset") {
          totalCount
        }
        publications: citations(query:"types.resourceTypeGeneral:Text") {
          totalCount
        }
        hostingInstitution: contributors(contributorType: "HostingInstitution") {
          id
          title: name
        }
        producer: contributors(contributorType: "Producer") {
          id
          title: name
        }
        funders: fundingReferences {
          id: funderIdentifier
          funderIdentifierType
          title: funderName
        }
        people: creators {
          id
          name
        }
        contributors {
          id
          name
        }
      }
    }
  }
}
""")
 

In [6]:
def get_data(type):
    if type == "organization":
        return client.execute(organizationQuery, variable_values=json.dumps(query_params))["organization"]
    elif type == "funder":
        return client.execute(funderQuery, variable_values=json.dumps(query_params))["funder"]
    else:
        return client.execute(repositoryQuery, variable_values=json.dumps(query_params))["repository"]


## Data Transformation

Simple transformations are performed to convert the graphql response into an array that can be used. The final array is composed of the columns used in the DMP statistics.

In [7]:
def get_series_size(series_element):
    return len(series_element)

In [8]:
def get_total(series_element):
    if len(series_element) == 0:
        return 0
    return series_element['totalCount']

In [9]:
def dmp_header(row):
    s = 'DMP: '+ row.dmp + '\r Funder: '+row.funders+'\r Producer: '+row.producer+'\r Host: '+row.hostingInstitution
    return s
     

In [10]:
def get_dataset_nodes(series_element):
    return series_element['nodes']

In [11]:
def get_title(series_element):
    if len(series_element) == 0:
        return "None"
    return series_element[0]['title']

In [12]:
def transform_dmps(dataframe):
    """Modifies each item to include attributes needed for the node visulisation

    Parameters:
    dataframe (dataframe): A dataframe with all the itemss
    parent (int): The id of the parent node

    Returns:
    dataframe:Returning vthe same dataframe with new attributes

   """
    if (dataframe) is None:
        return pd.DataFrame() 
    else: 
        return (dataframe >>
        mutate(
            DMP = X.title.apply(get_title),
            doi = X.id,
            NumDatasets = X.datasets.apply(get_total),
            NumPublications = X.publications.apply(get_total),
            Host = X.hostingInstitution.apply(get_title),
            Producer = X.producer.apply(get_title),
            Funder = X.funders.apply(get_title),
            NumPeople = (X.people + X.contributors).apply(get_series_size)
        ) 
        # >> 
        # mutate(
        #     header = dmp_header(X),
        # ) 
        # >>
        # filter_by(
        #     X.hostingInstitution > 0
        # )
        )
  

In [13]:
def processTable(type):
    data = get_data(type)
    if len(data["dataManagementPlans"]['nodes']) == 0:
        return None
    else:
        table = pd.DataFrame(data["dataManagementPlans"]['nodes'],columns=data["dataManagementPlans"]['nodes'][0].keys())
    return transform_dmps(table)[list(('DMP', 'Funder', 'Producer', 'Host','NumDatasets','NumPublications','NumPeople', 'doi'))].style.set_caption(data['name'])    

## DMP Statistics Visulisation


The following three tables show the DMP Statistics for three different entities. Each of the tables includes the DMP title, its funding body, producer, host, and summary statistics about the number of datasets, publications, and people linked to the DMP. The first table displays DMP statistics that are hosted by the California Digital Library. The next table displays the statistics of DMPs funded by the European Commission. Finally, the last table shows the DMP statistics stored in the Zenodo Repository.

In [17]:
processTable("organization")

Unnamed: 0,DMP,Funder,Producer,Host,NumDatasets,NumPublications,NumPeople,doi
0,DMPRoadmap: Making Data Management Plans Actionable,National Science Foundation (NSF),University Of California System,California Digital Library,0,0,4,https://doi.org/10.48321/d1mw28
1,LTREB: Drivers of temperate forest carbon storage from canopy closure through successional time,National Science Foundation (NSF),University Of Michigan,California Digital Library,1,3,5,https://doi.org/10.48321/d1h59r
2,"Late Season Productivity, Carbon, and Nutrient Dynamics in a Changing Arctic",National Science Foundation (NSF),Oregon State University,California Digital Library,0,0,5,https://doi.org/10.48321/d17p4j
3,REU Site: A Multidisciplinary Research Experience in Engineered Bioactive Interfaces and Devices,National Science Foundation (NSF),University Of Kentucky,California Digital Library,0,0,4,https://doi.org/10.48321/d1cc7t
4,Brown carbon characterization,National Science Foundation (NSF),"College, Harvey Mudd",California Digital Library,0,0,3,https://doi.org/10.48321/d13w2m
5,A Political Ecology of Value: A Cohort-Based Ethnography of the Environmental Turn in Nicaraguan Urban Social Policy,National Science Foundation (NSF),Western Washington University,California Digital Library,0,0,3,https://doi.org/10.48321/d10593
6,Finding Levers for Privacy and Security by Design in Mobile Development,National Science Foundation (NSF),"University Of Maryland, College Park",California Digital Library,0,0,4,https://doi.org/10.48321/d1vc75
7,Use of telemetry and the Acoustic Wave Glider to study southern flounder migrations,National Science Foundation (NSF),East Carolina University,California Digital Library,0,0,6,https://doi.org/10.48321/d1kw2z
8,"The Virgin Islands Partnership to Increase Participation and Engagement through Linked, Informal, Nurturing Experiences in STEM (V.I. PIPELINES)",National Science Foundation (NSF),University Of The Virgin Islands,California Digital Library,0,0,7,https://doi.org/10.48321/d1qp4w
9,DMP for The Role of Temperature in Regulating Herbivory and Algal Biomass in Upwelling Systems,National Science Foundation (NSF),"University Of North Carolina, Chapel Hill",California Digital Library,0,7,3,https://doi.org/10.48321/d1g59f


In [18]:
processTable("funder")

Unnamed: 0,DMP,Funder,Producer,Host,NumDatasets,NumPublications,NumPeople,doi
0,EURHISFIRM D1.2: Data Management Plan (first version),European Commission,,,0,0,3,https://doi.org/10.5281/zenodo.3245354
1,EURHISFIRM D1.2: Data Management Plan (first version),European Commission,,,0,0,3,https://doi.org/10.5281/zenodo.3245353
2,EURHISFIRM D1.7: Second Data Management Plan,European Commission,,,0,0,5,https://doi.org/10.5281/zenodo.3246339
3,EURHISFIRM D1.7: Second Data Management Plan,European Commission,,,0,0,5,https://doi.org/10.5281/zenodo.3246338
4,European Collaboration for Healthcare Optimisation (ECHO) Data Model Specification,European Commission,,,0,0,8,https://doi.org/10.5281/zenodo.3253683
5,European Collaboration for Healthcare Optimisation (ECHO) Data Model Specification,European Commission,,,0,0,8,https://doi.org/10.5281/zenodo.3253684
6,"REEEM-D6.6_Data Management Plan (DMP) - Collection, processing and dissemination of data",European Commission,,,0,0,1,https://doi.org/10.5281/zenodo.3368558
7,"REEEM-D6.6_Data Management Plan (DMP) - Collection, processing and dissemination of data",European Commission,,,0,0,1,https://doi.org/10.5281/zenodo.3368557
8,D6.5 Data Management Plan,European Commission,,,0,0,1,https://doi.org/10.5281/zenodo.3372460
9,D6.5 Data Management Plan,European Commission,,,0,0,1,https://doi.org/10.5281/zenodo.3372459


In [19]:
processTable("repository")

Unnamed: 0,DMP,Funder,Producer,Host,NumDatasets,NumPublications,NumPeople,doi
0,Periódicos técnicos,,,,0,0,1,https://doi.org/10.5281/zenodo.2655759
1,Periódicos técnicos,,,,0,0,1,https://doi.org/10.5281/zenodo.2655758
2,Fractional-order functions for solving fractional-order variational problems with boundary conditions,,,,0,0,2,https://doi.org/10.5281/zenodo.2741388
3,Fractional-order functions for solving fractional-order variational problems with boundary conditions,,,,0,0,2,https://doi.org/10.5281/zenodo.2741387
4,EURHISFIRM D1.2: Data Management Plan (first version),European Commission,,,0,0,3,https://doi.org/10.5281/zenodo.3245354
5,EURHISFIRM D1.2: Data Management Plan (first version),European Commission,,,0,0,3,https://doi.org/10.5281/zenodo.3245353
6,EURHISFIRM D1.7: Second Data Management Plan,European Commission,,,0,0,5,https://doi.org/10.5281/zenodo.3246339
7,EURHISFIRM D1.7: Second Data Management Plan,European Commission,,,0,0,5,https://doi.org/10.5281/zenodo.3246338
8,Example ezDMP output,,,,0,0,1,https://doi.org/10.5281/zenodo.3247755
9,Example ezDMP output,,,,0,0,1,https://doi.org/10.5281/zenodo.3247756
