# Getting started with EpiGraphDB in Python

This notebook is provided as a brief introductory guide to working with the EpiGraphDB platform through Python. Here we will demonstrate a few basic operations that can be carried out using the platform, but for more advanced methods please refer to the [API endpoint documentation](http://docs.epigraphdb.org/api/api-endpoints/).

A Python wrapper for EpiGraphDB's API is currently in the works, but for now we will be querying it directly using the `requests` library- knowledge of this package is advantageous but not essential.

In [1]:
import requests

First, we will ping the API to check our connection:

In [2]:
# Store our API URL as a string for future use
API_URL = "https://api.epigraphdb.org"

# Here we use the .get() method to send a GET request to the /ping endpoint of the API
endpoint = '/ping'
response_object = requests.get(API_URL + endpoint)  

# Check that the ping was sucessful
response_object.raise_for_status() 
print("If this line gets printed, ping was sucessful.")

If this line gets printed, ping was sucessful.


***
## 1. Using EpiGraphDB to obtain biological mappings

In this first section, we will take an arbitrary list of genes and query the EpiGraph API to find the proteins that they map to. We will be using the `POST` HTTP method which requires its parameters to be passed in JSON format, a conversion that is easy to do using the `json` library. To find the correct names of the parameters that we are about to set, we can navigate to the [EpiGraphDB API documentation](http://docs.epigraphdb.org/api/api-endpoints/) and find the endpoint of interest. From there we simply read off the parameters that we want to pass, and can take a look at the example request as a reference point if needed.

In [3]:
# 1.1 Mapping genes to proteins

# Set parameters and convert to JSON format
import json
params = {
  "gene_name_list": [
    "TP53",
    "BRCA1", 
    "TNF"
  ]
}
json_params = json.dumps(params)

# Define which endpoint of the API we would like to connect with
endpoint = '/mappings/gene-to-protein'

# Send the POST request
response_object = requests.post(API_URL + endpoint, data=json_params)

# Check for successful request
response_object.raise_for_status()

# Store results in a pandas dataframe
import pandas as pd
results = response_object.json()['results']
gene_protein_df = pd.json_normalize(results)

gene_protein_df.head()

Unnamed: 0,gene.name,gene.ensembl_id,protein.uniprot_id
0,TP53,ENSG00000141510,P04637
1,BRCA1,ENSG00000012048,P38398
2,TNF,ENSG00000232810,P01375


In the above cell, we queried EpiGraphDB for the proteins that have been mapped to the genes *TP53*, *BRCA1*, and *TNF*. Our query went through successfully and we received an associated protein for each. The columns in our output dataframe take the general form `entity.property` and this will remain consistent throughout this notebook. 

Specific descriptions for the properties of each entity can be found in EpiGraphDB's [data dictionary](https://docs.epigraphdb.org/graph-database/meta-nodes/). Simply click on the relevant entity in the table of contents  on the right hand side (or scroll down to the relevant section), then locate the property of interest.

In [4]:
# 1.2 Proteins to pathways

# As above, this is another POST request, so we need our data in JSON format
json_params = json.dumps({
  "uniprot_id_list": list(gene_protein_df['protein.uniprot_id'].values)
})

# Send the request
endpoint = '/protein/in-pathway'
response_object = requests.post(API_URL + endpoint, data=json_params)

# Check for successful request
response_object.raise_for_status()

# Store results
results = response_object.json()['results']
protein_pathway_df = pd.json_normalize(results)

protein_pathway_df.head()

Unnamed: 0,uniprot_id,pathway_count,pathway_reactome_id
0,P04637,5,"[R-HSA-6785807, R-HSA-390471, R-HSA-5689896, R..."
1,P38398,6,"[R-HSA-6796648, R-HSA-1221632, R-HSA-8953750, ..."
2,P01375,3,"[R-HSA-6785807, R-HSA-6783783, R-HSA-5357905]"


Above, we took the proteins that had been mapped to our genes of interest and queried the platform for their associated pathway data. The API found multiple such pathways for each gene and has returned the respective reactome IDs to us as lists.


It is worth noting here that so far we have only been accessing the `'results'` key in the nested dictionairy returned by the `.json()` method of our response object. The other available key is `'metadata'` (see the output below) which provides us with information about the request itself, including the specific Cypher query that the platform ran to get these results. If you would like to know more about the use of Cypher in these requests, there is a section dedicated to this at the end of this notebook.

In [5]:
from pprint import pprint
metadata = response_object.json()['metadata']

pprint(metadata)

{'empty_results': False,
 'query': 'MATCH p=(protein:Protein)-[r:PROTEIN_IN_PATHWAY]-(pathway:Pathway) '
          "WHERE protein.uniprot_id IN ['P04637', 'P38398', 'P01375'] RETURN "
          'protein.uniprot_id AS uniprot_id, count(p) AS pathway_count, '
          'collect(pathway.reactome_id) AS pathway_reactome_id',
 'total_seconds': 0.005797}


***
## 2. Epidemiological relationship analysis

In the cell below, we will query EpiGraphDB to get metadata relating to GWAS studies of a target trait- body mass index. Following that, queries will be performed to get pre-computed Mendelian Randomisation (MR) results involving the same trait. 

Here we will be using a different HTTP method than before- the `GET` method, which is in fact easier to use in Python because the parameters can be passed directly as a dictionary. To learn more about the differences between `GET` and `POST`, please see [this guide](https://www.w3schools.com/tags/ref_httpmethods.asp). 

In [6]:
# 2.1 Getting GWAS studies from EpiGraphDB

# Create a dictionary for the parameters to be passed
params = {
    'name':'Body mass index'
}

# Send the request
endpoint = '/meta/nodes/Gwas/search'
response_object = requests.get(API_URL + endpoint, params=params)
response_object.raise_for_status()

# Store the results of the query and display
result = response_object.json()['results']
gwas_df = pd.json_normalize(result)

gwas_df.head()

Unnamed: 0,node.note,node.access,node.year,node.mr,node.author,node.consortium,node.sex,node.priority,node.pmid,node.population,node.unit,node.sample_size,node.nsnp,node.trait,node.id,node.subcategory,node.category,node.sd
0,,public,2018,1,Hoffmann TJ,,,0,30108127,European,,315347,27854527,Body mass index,ebi-a-GCST006368,,,
1,,public,2015,1,Locke AE,,Males and Females,1,25673413,Mixed,,339224,2555511,Body mass index,ieu-a-2,Anthropometric,Risk factor,4.77
2,,public,2015,1,Locke AE,,Males,2,25673413,European,,152893,2477659,Body mass index,ieu-a-785,Anthropometric,Risk factor,4.77
3,,public,2015,1,Locke AE,,Males and Females,3,25673413,European,,322154,2554668,Body mass index,ieu-a-835,Anthropometric,Risk factor,4.77
4,,public,2017,1,Akiyama M,,,0,28892062,East Asian,,158284,5952516,Body mass index,ebi-a-GCST004904,,,


In [7]:
# 2.2 Getting MR results for a trait

# Set parameters
params = {'exposure_trait': 'Body mass index',
          'pval_threshold': 1e-10}

# Send request
endpoint = '/mr'
response_object = requests.get(API_URL + endpoint, params=params)
response_object.raise_for_status()

# Store and display results
result = response_object.json()['results']
BMI_MR_df = pd.json_normalize(result) 

BMI_MR_df.head()

Unnamed: 0,exposure.id,exposure.trait,outcome.id,outcome.trait,mr.b,mr.se,mr.pval,mr.method,mr.selection,mr.moescore
0,ieu-a-2,Body mass index,ukb-a-74,Non-cancer illness code self-reported: diabetes,0.034559,0.002418,0.0,FE IVW,DF,0.93
1,ieu-a-2,Body mass index,ukb-a-388,Hip circumference,0.724105,0.026588,0.0,Simple median,Tophits,0.95
2,ieu-a-2,Body mass index,ukb-a-382,Waist circumference,0.65644,0.024496,0.0,Simple median,Tophits,0.94
3,ieu-a-2,Body mass index,ukb-a-35,Comparative height size at age 10,0.136684,0.007909,0.0,FE IVW,Tophits,0.94
4,ieu-a-2,Body mass index,ukb-a-34,Comparative body size at age 10,0.36558,0.023556,0.0,Simple median,HF,0.87


The dataframe above displays the results of our query. We requested all traits for which an MR analysis using body mass index as the exposure variable returned a causal estimate with a p-value lower than 1e-10. Information regarding the specific MR parameters, as well as the exposure and outcome variables, has been displayed in the table for all traits that matched our search conditions.

In the parameters we set in 2.2, another viable parameter name is `'outcome_trait'` which takes the same type of values as `'exposure_trait'`. Either one or both of these parameters can be passed during an MR query, which allows users to refine which results are returned to them depending on their own analytical preferences.

***
## 3. Looking for literature evidence

Accessing information in the literature is a ubiquitous task in research, be it for novel hypothesis generation or as part of evidence triangulation. EpiGraphDB facilitates fast processing of this information by allowing access to a host of literature-mined relationships that have been structured into semantic triples. These take the general form *(subject, predicate, object)* and have been generated using contemporary natural language processing techniques applied to a massive amount of published biomedical research papers. In the following section we will query the API for the literature relationship between a given gene and an outcome trait.

In [8]:
# Establish parameters
params = {
    'gene_name': "IL23R",
    'object_name': "Inflammatory bowel disease"
}

# Send the request
endpoint = "/literature/gene"
response_object = requests.get(API_URL + endpoint, params=params)
response_object.raise_for_status()

# Store the results of the query and display
result = response_object.json()['results']
lit_df = pd.json_normalize(result) 

lit_df

Unnamed: 0,pubmed_id,gene.name,st.predicate,st.object_name
0,"[17484863, 21155887]",IL23R,NEG_ASSOCIATED_WITH,Inflammatory Bowel Diseases
1,[27852544],IL23R,AFFECTS,Inflammatory Bowel Diseases
2,"[17484863, 19575361, 19496308, 18383521, 18341...",IL23R,ASSOCIATED_WITH,Inflammatory Bowel Diseases
3,[23131344],IL23R,PREDISPOSES,Inflammatory Bowel Diseases


The dataframe outputted above shows the results of our query- four unique predicates were found between the gene *IL23R* and the trait *Inflammatory bowel disease* and are displayed in the `st.predicate` column. Our leftmost column contains the pubmed IDs of the papers from which this triple was derived. These IDs allow us to access the respective papers by navigating to `https://pubmed.ncbi.nlm.nih.gov/*insert_pubmed_id_here*`. In this particular case it seems that *ASSOCIATED_WITH* is the most common predicate linking our gene to the trait, but we can't see exactly how many papers there are due to how pandas displays lists. Let's add a paper count to the dataframe.

In [9]:
counts = [len(papers_list) for papers_list in lit_df['pubmed_id']]
lit_df['publication_count'] = counts

lit_df

Unnamed: 0,pubmed_id,gene.name,st.predicate,st.object_name,publication_count
0,"[17484863, 21155887]",IL23R,NEG_ASSOCIATED_WITH,Inflammatory Bowel Diseases,2
1,[27852544],IL23R,AFFECTS,Inflammatory Bowel Diseases,1
2,"[17484863, 19575361, 19496308, 18383521, 18341...",IL23R,ASSOCIATED_WITH,Inflammatory Bowel Diseases,21
3,[23131344],IL23R,PREDISPOSES,Inflammatory Bowel Diseases,1


-----
## 4. EpiGraphDB node search

EpiGraphDB stores data as nodes (entities) and edges (relationships) of a wide range of types. The `/meta` endpoints of the API offer us information about the structure of the graph itself- for example, the available classes of nodes can be listed through the `/meta/nodes/list` endpoint. Letâ€™s do that now:

In [10]:
# 4.1 Getting a list of available meta-nodes

# Send the request
endpoint = "/meta/nodes/list"
response_object = requests.get(API_URL + endpoint)
response_object.raise_for_status()

# Store the results of the query and display
result = response_object.json()

result

['Gwas',
 'Disease',
 'Drug',
 'Efo',
 'Event',
 'Gene',
 'Tissue',
 'Literature',
 'Pathway',
 'Protein',
 'SemmedTerm',
 'Variant']

This list above corresponds to EpiGraphDB's meta nodes, whose documentation can be found [here](https://docs.epigraphdb.org/graph-database/meta-nodes/) along with their available properties.

In the following, we will demonstrate how we can search by name for a node of interest, using the endpoint `/meta/nodes/{meta_node}/search`, where viable values for `{meta_node}` are those terms listed above.


In [11]:
# 4.2 Searching for specific entities by name

# Set params 
params = {
    'name': 'breast cancer'
}

# Make request
meta_node = 'Gwas'
endpoint = f"/meta/nodes/{meta_node}/search"
response_object = requests.get(API_URL + endpoint, params=params)
response_object.raise_for_status()

# Convert to pandas
results = pd.json_normalize(response_object.json()['results'])
target_node_id = results['node.id'][3]  # Store one ID for use in the next cell

results[['node.trait', 'node.id', 'node.sample_size', 'node.year', 'node.author']]

Unnamed: 0,node.trait,node.id,node.sample_size,node.year,node.author
0,Breast cancer,ebi-a-GCST007236,89677,2015,Michailidou K
1,Breast cancer,ebi-a-GCST004988,139274,2017,Michailidou K
2,Breast cancer (Combined Oncoarray; iCOGS; GWAS...,ieu-a-1126,228951,2017,Michailidou K
3,Breast cancer (GWAS),ieu-a-1131,32498,2017,Michailidou K
4,Breast cancer (GWAS),ieu-a-1168,33832,2015,Michailidou K
5,Breast cancer (Oncoarray),ieu-a-1129,106776,2017,Michailidou K
6,Breast cancer (Survival),ieu-a-1165,37954,2015,Guo Q
7,Breast cancer (iCOGS),ieu-a-1162,89677,2015,Michailidou K
8,Breast cancer (iCOGS),ieu-a-1130,89677,2017,Michailidou K
9,Breast cancer anti-estrogen resistance protein 3,prot-a-234,3301,2018,Sun BB


Above we used the `name` parameter of the endpoint to search for any GWAS nodes that fuzzily matched our specified string. Several did, and some of their basic node properties are displayed above. Fuzzy matching is useful because you don't need to know the exact name of the entity or its ID in order to look it up. 

On the other hand, once you have identified your entity of interest, it is often sensible to move forward using the node's ID for the sake of unambiguity. Fortunately we can also search for traits using their ID, as demonstrated below.

In [12]:
# 4.3 Searching for a node by ID

# Set params
params = {
    'id': target_node_id  # From previous cell
}

# Make request
meta_node = 'Gwas'
endpoint = f"/meta/nodes/{meta_node}/search"
response_object = requests.get(API_URL + endpoint, params=params)
response_object.raise_for_status()

# Convert to pandas
results = pd.json_normalize(response_object.json()['results'])

results

Unnamed: 0,node.ncase,node.access,node.year,node.mr,node.author,node.consortium,node.sex,node.priority,node.pmid,node.population,node.unit,node.sample_size,node.nsnp,node.ncontrol,node.trait,node.id,node.subcategory,node.category
0,14910,public,2017,1,Michailidou K,,Females,1,29059683,European,,32498,10680257,17588,Breast cancer (GWAS),ieu-a-1131,Cancer,Disease


***
## Advanced examples- Cypher

Until now, to get information from the platform we have been simply creating a dictionary or JSON object containing our parameters and then sending it to the correct endpoint of the API using the `requests` library. This is fine practice and the API has been designed specifically to allow this method of use, as we have (inexhaustively) demonstrated above. It works because the API automatically converts the HTTP requests that it receives into a Cypher query, which it then passes to the Neo4j database on which EpiGraphDB is built. The database passes back the result of the query, which is then returned to us in Python as a response object. Each response object contains metadata that includes the exact Cypher query that was called on the database, as shown in the cell below.

In [13]:
# 4.1 Cypher

params = {
  "gene_name_list": [
    "TP53"
  ]
}
json_params = json.dumps(params)
endpoint = '/mappings/gene-to-protein'
response_object = requests.post(API_URL + endpoint, data=json_params)
response_object.raise_for_status()

# Extract and print the Cypher query
cypher_query = response_object.json()['metadata']['query']
print(cypher_query)

MATCH (gene:Gene)-[gp:GENE_TO_PROTEIN]-(protein:Protein) WHERE gene.name IN ['TP53'] RETURN gene {.ensembl_id, .name}, protein {.uniprot_id}


The text printed above is the exact Cypher query that was run in section 1.1, behind the scenes. The basic structure of these queries is as follows:

                                            MATCH subgraph

                                            WHERE condition

                                            RETURN data

Note that the subgraph should take this general form: *(node)-[relationship]-(node)*, but for both nodes and relationships we write them as `my_variable_name:Meta_node` so that we can access their properties through the variable name we assigned them (my_variable_name), and use those properties to define our conditions and what data we want returned. Information on the available properties for each class of entity can be found in EpiGraphDB's documentation, specifically [here for nodes](https://docs.epigraphdb.org/graph-database/meta-nodes/) and [here for relationships](https://docs.epigraphdb.org/graph-database/meta-relationships/).

Now let's write and send our own basic query to get traits with high genetic correlation to body mass index:

In [14]:
# 4.2 Writing custom Cypher queries

# Define the target subgraph
cypher_query = 'MATCH (trait1:Gwas)-[corr:BN_GEN_COR]-(trait2:Gwas)'

# Add conditions to the query
cypher_query += ' WHERE trait1.trait = "Body mass index (BMI)" 
cypher_query += ' AND corr.rg > 0.9'

# Add which data we want returned
cypher_query += ' RETURN trait1, trait2, corr {.rg, .p}'

# Put our query into the correct format for a POST request
params = json.dumps({
    'query': cypher_query
})

# Define the target endpoint and send the request
endpoint = '/cypher'
response_object = requests.post(API_URL + endpoint, data=params)
response_object.raise_for_status()

# Display the returned data
results = response_object.json()['results']
results_df = pd.json_normalize(results)

results_df.head()

SyntaxError: EOL while scanning string literal (<ipython-input-14-bf982b5ed6c6>, line 7)

In our Cypher query, we grabbed a subgraph from the database that comprised nodes representing biomedical traits, with edges between them representing their genetic correlation. The subgraph was then filtered to select any node-edge-node triples where the first node had the `.trait` property of "Body mass index (BMI)", and where the edge between the nodes had a `.rg` (genetic correlation score) value greater than 0.9. We then asked Neo4j to return us the names of the two traits, as well as the score and p-value of the correlation between the two, for all triples not filtered out by our conditions. Finally, we converted the returned dictionary to a dataframe for ease of viewing.

For more detailed information on Cypher queries, please refer to the [official documentation](https://neo4j.com/developer/cypher/).