<a href="https://colab.research.google.com/github/ebi-ait/ingest-programmatic-submissions/blob/main/notebooks/create_project/programmatic_submissions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Create a project

This notebook is intended to give an insight into how to generate a submission by using the python tools available, the `hca-ingest` library. This library, amongst other utilities for interacting with the Ingest service, contains a wrapper for Ingest's API, which lets you easily create, update and delete.

This section will be focused around `Projects`

## Download example project

In order to have the files necessary for this guide, we're going to download the template file for the project metadata

In [1]:
!wget https://raw.githubusercontent.com/ebi-ait/ingest-programmatic-submissions/main/_data/submission_example/project/example_project.json

--2022-11-09 17:32:55--  https://raw.githubusercontent.com/ebi-ait/ingest-programmatic-submissions/main/_data/submission_example/project/example_project.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1232 (1.2K) [text/plain]
Saving to: ‘example_project.json’


2022-11-09 17:32:55 (34.3 MB/s) - ‘example_project.json’ saved [1232/1232]



## Set up  libraries and dependencies

### Install external libraries

In [2]:
!pip install hca-ingest

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting hca-ingest
  Downloading hca-ingest-2.6.0.tar.gz (57 kB)
[K     |████████████████████████████████| 57 kB 2.7 MB/s 
[?25hCollecting jsonref
  Downloading jsonref-1.0.1-py3-none-any.whl (9.5 kB)
Collecting polling
  Downloading polling-0.3.2.tar.gz (5.2 kB)
Collecting xlsxwriter
  Downloading XlsxWriter-3.0.3-py3-none-any.whl (149 kB)
[K     |████████████████████████████████| 149 kB 10.9 MB/s 
[?25hCollecting mergedeep
  Downloading mergedeep-1.3.4-py3-none-any.whl (6.4 kB)
Collecting cryptography
  Downloading cryptography-38.0.3-cp36-abi3-manylinux_2_24_x86_64.whl (4.1 MB)
[K     |████████████████████████████████| 4.1 MB 43.8 MB/s 
[?25hCollecting requests-cache
  Downloading requests_cache-0.9.7-py3-none-any.whl (48 kB)
[K     |████████████████████████████████| 48 kB 4.6 MB/s 
Collecting urllib3>=1.25.5
  Downloading urllib3-1.26.12-py2.py3-none-any.whl (140 kB)
[K    

### Load libraries

In [3]:
import requests as rq
import json
from hca_ingest.api.ingestapi import IngestApi

### Get a token

In order to get a token, you need to log in to the ingest UI: https://staging.contribute.data.humancellatlas.org/. For the purpose of this notebook, we will be using staging. However, that can be change to prod (by deleting the first part of the domain) or to dev (by changing `staging` to `dev`) at any point in the process.

If you are going to use any other environment, please remember to change the `environment` variable in the next section

The steps to obtain the token are detailed in this guide: [API tokens](https://ebi-ait.github.io/hca-ebi-dev-team/operations_tasks/api_token.html)

In [11]:
token = "Bearer <paste_token_here>"

### Set up environment and global variables

In [12]:
# Environment-related set-up and global variables used across the notebook
accepted_environments = {
    'develop': '.dev',
    'staging': '.staging',
    'production': ''
}

environment = 'staging'  #staging environment by default

# Set up environment value for API's URL
try:
  env_for_url = accepted_environments[environment]
except KeyError:
  print(f"Environment {environment} not recognised. Defaulting to staging")
  env_for_url = accepted_environments['staging']

base_url = f'https://api.ingest{env_for_url}.archive.data.humancellatlas.org'

# Set up API object
api = IngestApi(url=base_url)
headers = api.set_token(token=token)


## Create a project

This block of code will be dedicated to creating a project within ingest. The following will be assumed:
* A JSON entity is available for use as the "content"

For the purpose of this notebook, everything will be performed in the staging environment. To perform this on other environments (e.g. prod), please update the `environment` variable to any of the values accepted in `accepted_environments`

In [13]:
# Load the project metadata entity
with open('example_project.json', 'r') as f:
  project_content = json.load(f)

ingest_project = api.create_project(submission_url='', content=project_content)


The returned object is the project as contained by ingest: this object contains the metadata that was submitted in the previous step, but also contains some extra, important metadata:

* uuid: Unique identifier for your project, generated randomly
* Management metadata: This metadata comprises metadata that will apply to your experiment, e.g. organs, species used, etc.

We're going to print the object and take a look

In [14]:
ingest_project

{'content': {'describedBy': 'https://schema.staging.data.humancellatlas.org/type/project/17.0.0/project',
  'schema_type': 'project',
  'project_core': {'project_short_name': 'myCoolLabel',
   'project_title': 'Test_project_with_minimum_information',
   'project_description': 'This is a test project with minimum information for the programmatic submissions guide'},
  'contributors': [{'name': 'Enrique,,Ventura',
    'email': 'enrique@ebi.ac.uk',
    'institution': 'EMBL-EBI',
    'corresponding_contributor': True,
    'project_role': {'text': 'data curator',
     'ontology': 'EFO:0009737',
     'ontology_label': 'data curator'}}],
  'publications': [{'authors': ['Lorem IP', 'Sed UP'],
    'title': 'A combined approach for single-cell mRNA and intracellular protein expression analysis',
    'url': 'https://www.frontiersin.org/articles/10.3389/fcell.2020.00384/full',
    'official_hca_publication': False}],
  'funders': [{'grant_title': 'a cool grant',
    'grant_id': '000000000bp1',
   

Everything looks correct, so we will save the identifier for our project (called the `uuid`) and store it in case we need to retrieve the project later.

In [15]:
# Store project uuid
ingest_project_uuid = ingest_project['uuid']['uuid']

### Understanding the information on the project

After printing the resulting `ingest_project`, you probably have noticed that there is much more meatadata than what was sent; for most entities, this is just system-generated and you don't need to worry about it. 

However, for `project` metadata, we load some information regarding statuses and general-level metadata for different purposes (e.g. display in the [project catalogue](https://www.ebi.ac.uk/humancellatlas/project-catalogue/)).

This project-level metadata is explained in more detail in the [create a project](https://ebi-ait.github.io/ingest-programmatic-submissions/docs/create_a_project/create_a_project.html) guideline associated with this notebook. For now, we will focus on the metadata that we should fill out:

In [16]:
minimum_required_fields = {
    'releaseDate': None,          # Date that you want your data to be released. If the data is to be released as soon as possible, or if data has already been released (e.g. in GEO) input today's date in format: YYYY-MM-DDT00:00:00Z (e.g. 2021-11-29T00:00:00Z)
    'accessionDate': None,        # Same as above, but for accessioning in public archives.
    'technology': None,           # Library preparation technology(ies) used in the experiment, ontologised. More below.
    'organ': None,                # Organ(s) used in the experimnt, ontologised. More below
    'cellCount': None,            # Estimated number of cells generated by this project.
    'dataAccess': None,           # Type of data access, selected from a list of terms. For more detail, refer to readme.
    'identifyingOrganisms': None, # Organism that was used to generate the data, can be: Human, Mouse, or both.
    'primaryWrangler': None,      # Person that is in charge of the project/submission: associated with a user.
    'wranglingState': None,       # Status of the project. For a detailed list of accepted values, refer to readme.
    'wranglingPriority': None,    # 1, 2, or 3. 1 is highest priority and 3 is lowest. Refer to readme for more information.
    'wranglingNotes': None,       # Extra notes associated with the project; feel free to input your own notes here.
    'isInCatalogue': None,        # If the project is to be displayed in the catalogue, True, otherwise False
    }

### Adding minimum information
Now, we will be modifying the information on the list above, to make sure we enter the minimum amount of metadata that the project should contain. We're going to divide the fields in 2 types:
* **Ontologised**: fields that are validated against the [HCA ontology](https://ontology.archive.data.humancellatlas.org/index).
* **Other**: Fields that have are not ontologised and that are validated against other premises.

We're going to start with the ontologised fields.

#### Ontologised fields

These terms are called "ontologised" because they are validated against a set of restrictions defined both in our validation rules and enforced in the ontologies themselves; for example, `organ` validates that the term used as an input is validated as a child term, only with relationship `subclassOf`, of the term `anatomical structure`([UBERON:0000061](https://ontology.archive.data.humancellatlas.org/ontologies/hcao/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FUBERON_0000061)). Detailed information on the restrictions can be found in the readme file.

In this category, we have 2 fields:
- organ: A list of the organs used in this experiment; for this notebook, we're going to use the terms "lung"([UBERON:0002048](https://ontology.archive.data.humancellatlas.org/ontologies/hcao/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FUBERON_0002048)) and "heart"([UBERON:0000948](https://ontology.archive.data.humancellatlas.org/ontologies/hcao/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FUBERON_0000948)).
- technology: A list of the library preparation technologies used in this experiment; for this notebook, we're going to use the terms `10x 3' v2`([EFO:0009899](https://ontology.archive.data.humancellatlas.org/ontologies/efo/terms?iri=http%3A%2F%2Fwww.ebi.ac.uk%2Fefo%2FEFO_0009899)) and `10x 3' v3`([EFO:0009922](https://ontology.archive.data.humancellatlas.org/ontologies/efo/terms?iri=http%3A%2F%2Fwww.ebi.ac.uk%2Fefo%2FEFO_0009922)). This field also accepts free text in case there is no ontology for the term just yet; we are also going to add an entry for this


In [17]:
# Set up organ
organ = {
    "ontologies": [
      {
        "text": "lung",                 # Text field, free string that allows the user to introduce a more exact definition of the term if not available in the ontology
        "ontology": "UBERON:0002048",   # Unique identifier for the ontology term, in the form of <ontology>:<ID>
        "ontology_label": "lung"        # Text field, must exactly match the label provided in the ontology, case sensitive.
      },
      {
        "text": "heart",
        "ontology": "UBERON:0000948",
        "ontology_label": "heart"
      }
    ]
}

# Set up technology
technology = {
    "ontologies": [
      {
        "text": "10x 3' v2",
        "ontology": "EFO:0009899",
        "ontology_label": "10x 3' v2"
      },
      {
        "text": "10x 3' v3",
        "ontology": "EFO:0009922",
        "ontology_label": "10x 3' v3"
      }
    ],
    "others": [
        "Mysupercoollibrarypreptechnology"  # Free text field to introduce as many terms as you want that couldn't be found in the ontology
    ]
}

# pass the values to our variable
minimum_required_fields['organ'] = organ
minimum_required_fields['technology'] = technology

#### Other fields

In [18]:
# Dates
# Dates must follow the following format: YYYY-MM-DDThh:mm:ssZ
minimum_required_fields['releaseDate'] = "2022-08-30T00:00:00Z"
minimum_required_fields['accessionDate'] = "2022-08-30T00:00:00Z"

# Enum values
# Set of values accepted are predetermined, depending on the field. 
# For the full list of values, please refer to the readme
minimum_required_fields['dataAccess'] = {
                                          "type": "All fully open",
                                          "notes": "Can be released publicly! :D"
                                        }
minimum_required_fields['identifyingOrganisms'] = ["Human", "Mouse", "Other"]
minimum_required_fields['wranglingPriority'] = 1 # Very important project!
minimum_required_fields['wranglingState'] = "Eligible"

# Simple values
# Set of fields that have a simple value; it may be a free string, an integer or a boolean
minimum_required_fields['cellCount'] = 17500
minimum_required_fields['primaryWrangler'] = ingest_project['user'] # User ID is required in this field.
minimum_required_fields['wranglingNotes'] = "This is an awesome project and I will finish it soon"
minimum_required_fields['isInCatalogue'] = True # We want the project to be displayed in the project catalogue

## Updating project with missing information

Now that we understand the metadata that we are handling, and that we have filled in the missing bits necessary for a minimum information project, we will update the project with the values that we have been gathering.

Once we have the content that we have to update, the update itself is pretty easy!

In [19]:
# Retrieve project URL to update
ingest_project_url = ingest_project['_links']['self']['href']
response = api.patch(url=ingest_project_url, json=minimum_required_fields)

updated_ingest_project = response.json()

Let's print the project and check if the changes have made it through!

In [20]:
updated_ingest_project

{'content': {'describedBy': 'https://schema.staging.data.humancellatlas.org/type/project/17.0.0/project',
  'schema_type': 'project',
  'project_core': {'project_short_name': 'myCoolLabel',
   'project_title': 'Test_project_with_minimum_information',
   'project_description': 'This is a test project with minimum information for the programmatic submissions guide'},
  'contributors': [{'name': 'Enrique,,Ventura',
    'email': 'enrique@ebi.ac.uk',
    'institution': 'EMBL-EBI',
    'corresponding_contributor': True,
    'project_role': {'text': 'data curator',
     'ontology': 'EFO:0009737',
     'ontology_label': 'data curator'}}],
  'publications': [{'authors': ['Lorem IP', 'Sed UP'],
    'title': 'A combined approach for single-cell mRNA and intracellular protein expression analysis',
    'url': 'https://www.frontiersin.org/articles/10.3389/fcell.2020.00384/full',
    'official_hca_publication': False}],
  'funders': [{'grant_title': 'a cool grant',
    'grant_id': '000000000bp1',
   

And we have our project, updated, with the minimum required metadata!

## Retrieve a project

Once we have created a project with minimum information, we may want to retrieve the project to do further things with it (Add more metadata, check status, etc). In order to do this, we are going to use one of the many functions that we have available to retrieve a project:
- `IngestApi.get_project_by_uuid`: Retrieves a single project with a UUID

But there are other functions available, in case you don't have the UUID at hand or can't remember, listed below:

<details>
<summary>Functions to search for projects</summary>
<ul>
<li>.get_user_projects: Retrieve all the projects associated with your user (Requires token to be set)</li>
<li>.get_project_by_id: Retrieve a project with the MongoDB ID provided</li>
</ul>
</details>


In [21]:
ingest_project = api.get_project_by_uuid(ingest_project_uuid)

Let's ensure we have retrieved our project correctly:

In [22]:
ingest_project

{'content': {'describedBy': 'https://schema.staging.data.humancellatlas.org/type/project/17.0.0/project',
  'schema_type': 'project',
  'project_core': {'project_short_name': 'myCoolLabel',
   'project_title': 'Test_project_with_minimum_information',
   'project_description': 'This is a test project with minimum information for the programmatic submissions guide'},
  'contributors': [{'name': 'Enrique,,Ventura',
    'email': 'enrique@ebi.ac.uk',
    'institution': 'EMBL-EBI',
    'corresponding_contributor': True,
    'project_role': {'text': 'data curator',
     'ontology': 'EFO:0009737',
     'ontology_label': 'data curator'}}],
  'publications': [{'authors': ['Lorem IP', 'Sed UP'],
    'title': 'A combined approach for single-cell mRNA and intracellular protein expression analysis',
    'url': 'https://www.frontiersin.org/articles/10.3389/fcell.2020.00384/full',
    'official_hca_publication': False}],
  'funders': [{'grant_title': 'a cool grant',
    'grant_id': '000000000bp1',
   

### Check status of a project

When a project (or any piece of metadata) is updated to ingest, it gets validated, the `content` being validated against the schema it is pointing to (on the `describedBy` field), and in the case of the project, the base fields validating against other set of rules.

The ingest service has the ability to provide with a full report of these validation events, including the status of the entity and the error messages.

On this section, we will focus on retrieving the errors (currently none) of the project we just uploaded and we will update the project to artificially produce a couple of errors. We will then retrieve the project again and check on the errors, but for a detailed explanation of each type of error, please refer to the Readme file.

In [23]:
# Print validation errors
validation_errors = ingest_project['validationErrors']
print(f"Validation errors: {validation_errors if validation_errors else None}")

# Print validation status
validation_status = ingest_project['validationState']
print(f"Validation status: {validation_status}")

Validation errors: None
Validation status: Valid


In [24]:
non_valid_content = ingest_project['content']
non_valid_content['estimated_cell_count'] = '17500'             # Cell count should always be an integer
non_valid_content['insdc_project_accessions'] =  [              # INSDC project accessions:
                                      'GSE7777777',   # SHOULD NOT be a GEO series accession
                                      'SRP000000',    # SHOULD follow SRPXXXXXX format
                                      '',             # SHOULD NOT be an empty string
                                      347289347       # SHOULD NOT be a numer
                                      ]

non_valid_values =  {    
                      "content" : non_valid_content   # Patching "content" field
                    } 

# Patch the non_valid content into the project content
response = api.patch(url=ingest_project_url, json=non_valid_values)

In [25]:
response.json()

{'content': {'describedBy': 'https://schema.staging.data.humancellatlas.org/type/project/17.0.0/project',
  'schema_type': 'project',
  'project_core': {'project_short_name': 'myCoolLabel',
   'project_title': 'Test_project_with_minimum_information',
   'project_description': 'This is a test project with minimum information for the programmatic submissions guide'},
  'contributors': [{'name': 'Enrique,,Ventura',
    'email': 'enrique@ebi.ac.uk',
    'institution': 'EMBL-EBI',
    'corresponding_contributor': True,
    'project_role': {'text': 'data curator',
     'ontology': 'EFO:0009737',
     'ontology_label': 'data curator'}}],
  'publications': [{'authors': ['Lorem IP', 'Sed UP'],
    'title': 'A combined approach for single-cell mRNA and intracellular protein expression analysis',
    'url': 'https://www.frontiersin.org/articles/10.3389/fcell.2020.00384/full',
    'official_hca_publication': False}],
  'funders': [{'grant_title': 'a cool grant',
    'grant_id': '000000000bp1',
   

After patching the project with invalid values, let's repeat the check we did previously.

In [26]:
ingest_project = api.get_project_by_uuid(ingest_project_uuid)
# Print validation errors
validation_errors = ingest_project['validationErrors']
newline = '\n'
print(f"Validation errors: {validation_errors if validation_errors else None}")

# Print validation status
validation_status = ingest_project['validationState']
print(f"Validation status: {validation_status}")

Validation errors: [{'errorType': 'METADATA_ERROR', 'message': 'should match pattern "^[D|E|S]RP[0-9]+$"', 'userFriendlyMessage': 'should match pattern "^[D|E|S]RP[0-9]+$" at .insdc_project_accessions[0]', 'absoluteDataPath': '.insdc_project_accessions[0]'}, {'errorType': 'METADATA_ERROR', 'message': 'should match pattern "^[D|E|S]RP[0-9]+$"', 'userFriendlyMessage': 'should match pattern "^[D|E|S]RP[0-9]+$" at .insdc_project_accessions[2]', 'absoluteDataPath': '.insdc_project_accessions[2]'}, {'errorType': 'METADATA_ERROR', 'message': 'should be string', 'userFriendlyMessage': 'should be string at .insdc_project_accessions[3]', 'absoluteDataPath': '.insdc_project_accessions[3]'}, {'errorType': 'METADATA_ERROR', 'message': 'should be integer', 'userFriendlyMessage': 'should be integer at .estimated_cell_count', 'absoluteDataPath': '.estimated_cell_count'}]
Validation status: Invalid


As we can see, this time it has returned 2 things:
- A set of errors, comprised in a list that details the errors, from type to message.
- Validation status: invalid, indicating that the validation went wrong.

For detailed information on how to understand the errors, please proceed to the "readme.md" file.

## Delete a project

Projects in our database can be deleted. While we do not advise to delete projects once they have been published in the data portal (`uuid` identifiers are important for updates), at any point before finishing the submission (Later in the notebook), any metadata entity can be deleted, including projects.

In [27]:
# Delete ingest project and check everything went correctly
response = api.delete(ingest_project_url)

assert response.status_code == 204

If the status code of the response is 204, the project has been deleted!

# Addendum

## Updating projects

### Deleting a field/Replacing all values

Deleting a field requires a slightly different sort of operation; up until now, we have used `patch` to address field modifications. However, if we want to delete a field or replace all values, we would need to delete the field from the content, and then PUT the whole content of the project entity to the project URL.

This operation will completely replace the older entry with the new one; using the old one as a template ensures critical fields (e.g. `uuid`) get preserved over this operation.

### Adding new fields

When adding new fields, considering the type of field that is going to be added is essential; nested properties and arrays can't be just modified through a `patch` operation, they need the document to be partially (or entirely) replaced

In this notebook, we are going to add 2 fields:
- A completely new field, available in the schema, `insdc_project_accessions`
- A new publication that we want associated to this project, without deleting the already existing one.

In [None]:
# Adding the INSDC project accession
response = api.patch(url=ingest_project_url, patch={"content": {"insdc_project_accessions": ["SRP000000"]}})
assert response.status_code == 200
updated_project = api.get_project_by_uuid(ingest_project_uuid)

# Let's print the project and ensure the modification has gone through!
updated_project

{'content': {'insdc_project_accessions': ['SRP000000']},
 'submissionDate': '2022-08-31T14:05:37.625Z',
 'updateDate': '2022-08-31T14:07:18.255Z',
 'user': '5ece3464ec0680746267e784',
 'lastModifiedUser': '5ece3464ec0680746267e784',
 'type': 'Project',
 'uuid': {'uuid': '019b3b05-903b-4b85-bdae-1e10589ccd06'},
 'events': [],
 'firstDcpVersion': '2022-08-31T14:05:37.625Z',
 'dcpVersion': '2022-08-31T14:07:18.252Z',
 'contentLastModified': '2022-08-31T14:07:18.252Z',
 'accession': None,
 'validationState': 'Draft',
 'validationErrors': [],
 'graphValidationErrors': None,
 'isUpdate': False,
 'releaseDate': '2022-08-30T00:00:00Z',
 'accessionDate': '2022-08-30T00:00:00Z',
 'technology': {'ontologies': [{'text': "10x 3' v2",
    'ontology': 'EFO:0009899',
    'ontology_label': "10x 3' v2"},
   {'text': "10x 3' v3",
    'ontology': 'EFO:0009922',
    'ontology_label': "10x 3' v3"}],
  'others': ['Mysupercoollibrarypreptechnology']},
 'organ': {'ontologies': [{'text': 'lung',
    'ontology':