{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Getting started with EpiGraphDB in Python\n",
"\n",
"This notebook is provided as a brief introductory guide to working with the EpiGraphDB platform through Python. Here we will demonstrate a few basic operations that can be carried out using the platform, but for more advanced methods please refer to the [API endpoint documentation](http://docs.epigraphdb.org/api/api-endpoints/).\n",
"\n",
"A Python wrapper for EpiGraphDB's API is currently in the works, but for now we will be querying it directly using the `requests` library- knowledge of this package is advantageous but not essential."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import requests"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First, we will ping the API to check our connection:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"If this line gets printed, ping was sucessful.\n"
]
}
],
"source": [
"# Store our API URL as a string for future use\n",
"API_URL = \"https://api.epigraphdb.org\"\n",
"\n",
"# Here we use the .get() method to send a GET request to the /ping endpoint of the API\n",
"endpoint = '/ping'\n",
"response_object = requests.get(API_URL + endpoint) \n",
"\n",
"# Check that the ping was sucessful\n",
"response_object.raise_for_status() \n",
"print(\"If this line gets printed, ping was sucessful.\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"***\n",
"## 1. Using EpiGraphDB to obtain biological mappings\n",
"\n",
"In this first section, we will take an arbitrary list of genes and query the EpiGraph API to find the proteins that they map to. We will be using the `POST` HTTP method which requires its parameters to be passed in JSON format, a conversion that is easy to do using the `json` library. To find the correct names of the parameters that we are about to set, we can navigate to the [EpiGraphDB API documentation](http://docs.epigraphdb.org/api/api-endpoints/) and find the endpoint of interest. From there we simply read off the parameters that we want to pass, and can take a look at the example request as a reference point if needed."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" gene.name | \n",
" gene.ensembl_id | \n",
" protein.uniprot_id | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" TP53 | \n",
" ENSG00000141510 | \n",
" P04637 | \n",
"
\n",
" \n",
" 1 | \n",
" BRCA1 | \n",
" ENSG00000012048 | \n",
" P38398 | \n",
"
\n",
" \n",
" 2 | \n",
" TNF | \n",
" ENSG00000232810 | \n",
" P01375 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" gene.name gene.ensembl_id protein.uniprot_id\n",
"0 TP53 ENSG00000141510 P04637\n",
"1 BRCA1 ENSG00000012048 P38398\n",
"2 TNF ENSG00000232810 P01375"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 1.1 Mapping genes to proteins\n",
"\n",
"# Set parameters and convert to JSON format\n",
"import json\n",
"params = {\n",
" \"gene_name_list\": [\n",
" \"TP53\",\n",
" \"BRCA1\", \n",
" \"TNF\"\n",
" ]\n",
"}\n",
"json_params = json.dumps(params)\n",
"\n",
"# Define which endpoint of the API we would like to connect with\n",
"endpoint = '/mappings/gene-to-protein'\n",
"\n",
"# Send the POST request\n",
"response_object = requests.post(API_URL + endpoint, data=json_params)\n",
"\n",
"# Check for successful request\n",
"response_object.raise_for_status()\n",
"\n",
"# Store results in a pandas dataframe\n",
"import pandas as pd\n",
"results = response_object.json()['results']\n",
"gene_protein_df = pd.json_normalize(results)\n",
"\n",
"gene_protein_df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the above cell, we queried EpiGraphDB for the proteins that have been mapped to the genes *TP53*, *BRCA1*, and *TNF*. Our query went through successfully and we received an associated protein for each. The columns in our output dataframe take the general form `entity.property` and this will remain consistent throughout this notebook. \n",
"\n",
"Specific descriptions for the properties of each entity can be found in EpiGraphDB's [data dictionary](https://docs.epigraphdb.org/graph-database/meta-nodes/). Simply click on the relevant entity in the table of contents on the right hand side (or scroll down to the relevant section), then locate the property of interest."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" uniprot_id | \n",
" pathway_count | \n",
" pathway_reactome_id | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" P04637 | \n",
" 5 | \n",
" [R-HSA-6785807, R-HSA-390471, R-HSA-5689896, R... | \n",
"
\n",
" \n",
" 1 | \n",
" P38398 | \n",
" 6 | \n",
" [R-HSA-6796648, R-HSA-1221632, R-HSA-8953750, ... | \n",
"
\n",
" \n",
" 2 | \n",
" P01375 | \n",
" 3 | \n",
" [R-HSA-6785807, R-HSA-6783783, R-HSA-5357905] | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" uniprot_id pathway_count pathway_reactome_id\n",
"0 P04637 5 [R-HSA-6785807, R-HSA-390471, R-HSA-5689896, R...\n",
"1 P38398 6 [R-HSA-6796648, R-HSA-1221632, R-HSA-8953750, ...\n",
"2 P01375 3 [R-HSA-6785807, R-HSA-6783783, R-HSA-5357905]"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 1.2 Proteins to pathways\n",
"\n",
"# As above, this is another POST request, so we need our data in JSON format\n",
"json_params = json.dumps({\n",
" \"uniprot_id_list\": list(gene_protein_df['protein.uniprot_id'].values)\n",
"})\n",
"\n",
"# Send the request\n",
"endpoint = '/protein/in-pathway'\n",
"response_object = requests.post(API_URL + endpoint, data=json_params)\n",
"\n",
"# Check for successful request\n",
"response_object.raise_for_status()\n",
"\n",
"# Store results\n",
"results = response_object.json()['results']\n",
"protein_pathway_df = pd.json_normalize(results)\n",
"\n",
"protein_pathway_df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Above, we took the proteins that had been mapped to our genes of interest and queried the platform for their associated pathway data. The API found multiple such pathways for each gene and has returned the respective reactome IDs to us as lists.\n",
"\n",
"\n",
"It is worth noting here that so far we have only been accessing the `'results'` key in the nested dictionairy returned by the `.json()` method of our response object. The other available key is `'metadata'` (see the output below) which provides us with information about the request itself, including the specific Cypher query that the platform ran to get these results. If you would like to know more about the use of Cypher in these requests, there is a section dedicated to this at the end of this notebook."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'empty_results': False,\n",
" 'query': 'MATCH p=(protein:Protein)-[r:PROTEIN_IN_PATHWAY]-(pathway:Pathway) '\n",
" \"WHERE protein.uniprot_id IN ['P04637', 'P38398', 'P01375'] RETURN \"\n",
" 'protein.uniprot_id AS uniprot_id, count(p) AS pathway_count, '\n",
" 'collect(pathway.reactome_id) AS pathway_reactome_id',\n",
" 'total_seconds': 0.005797}\n"
]
}
],
"source": [
"from pprint import pprint\n",
"metadata = response_object.json()['metadata']\n",
"\n",
"pprint(metadata)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"***\n",
"## 2. Epidemiological relationship analysis\n",
"\n",
"In the cell below, we will query EpiGraphDB to get metadata relating to GWAS studies of a target trait- body mass index. Following that, queries will be performed to get pre-computed Mendelian Randomisation (MR) results involving the same trait. \n",
"\n",
"Here we will be using a different HTTP method than before- the `GET` method, which is in fact easier to use in Python because the parameters can be passed directly as a dictionary. To learn more about the differences between `GET` and `POST`, please see [this guide](https://www.w3schools.com/tags/ref_httpmethods.asp). "
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" node.note | \n",
" node.access | \n",
" node.year | \n",
" node.mr | \n",
" node.author | \n",
" node.consortium | \n",
" node.sex | \n",
" node.priority | \n",
" node.pmid | \n",
" node.population | \n",
" node.unit | \n",
" node.sample_size | \n",
" node.nsnp | \n",
" node.trait | \n",
" node.id | \n",
" node.subcategory | \n",
" node.category | \n",
" node.sd | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" NA | \n",
" public | \n",
" 2018 | \n",
" 1 | \n",
" Hoffmann TJ | \n",
" NA | \n",
" NA | \n",
" 0 | \n",
" 30108127 | \n",
" European | \n",
" NA | \n",
" 315347 | \n",
" 27854527 | \n",
" Body mass index | \n",
" ebi-a-GCST006368 | \n",
" NA | \n",
" NA | \n",
" NaN | \n",
"
\n",
" \n",
" 1 | \n",
" NaN | \n",
" public | \n",
" 2015 | \n",
" 1 | \n",
" Locke AE | \n",
" NA | \n",
" Males and Females | \n",
" 1 | \n",
" 25673413 | \n",
" Mixed | \n",
" NA | \n",
" 339224 | \n",
" 2555511 | \n",
" Body mass index | \n",
" ieu-a-2 | \n",
" Anthropometric | \n",
" Risk factor | \n",
" 4.77 | \n",
"
\n",
" \n",
" 2 | \n",
" NaN | \n",
" public | \n",
" 2015 | \n",
" 1 | \n",
" Locke AE | \n",
" NA | \n",
" Males | \n",
" 2 | \n",
" 25673413 | \n",
" European | \n",
" NA | \n",
" 152893 | \n",
" 2477659 | \n",
" Body mass index | \n",
" ieu-a-785 | \n",
" Anthropometric | \n",
" Risk factor | \n",
" 4.77 | \n",
"
\n",
" \n",
" 3 | \n",
" NaN | \n",
" public | \n",
" 2015 | \n",
" 1 | \n",
" Locke AE | \n",
" NA | \n",
" Males and Females | \n",
" 3 | \n",
" 25673413 | \n",
" European | \n",
" NA | \n",
" 322154 | \n",
" 2554668 | \n",
" Body mass index | \n",
" ieu-a-835 | \n",
" Anthropometric | \n",
" Risk factor | \n",
" 4.77 | \n",
"
\n",
" \n",
" 4 | \n",
" NA | \n",
" public | \n",
" 2017 | \n",
" 1 | \n",
" Akiyama M | \n",
" NA | \n",
" NA | \n",
" 0 | \n",
" 28892062 | \n",
" East Asian | \n",
" NA | \n",
" 158284 | \n",
" 5952516 | \n",
" Body mass index | \n",
" ebi-a-GCST004904 | \n",
" NA | \n",
" NA | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" node.note node.access node.year node.mr node.author node.consortium \\\n",
"0 NA public 2018 1 Hoffmann TJ NA \n",
"1 NaN public 2015 1 Locke AE NA \n",
"2 NaN public 2015 1 Locke AE NA \n",
"3 NaN public 2015 1 Locke AE NA \n",
"4 NA public 2017 1 Akiyama M NA \n",
"\n",
" node.sex node.priority node.pmid node.population node.unit \\\n",
"0 NA 0 30108127 European NA \n",
"1 Males and Females 1 25673413 Mixed NA \n",
"2 Males 2 25673413 European NA \n",
"3 Males and Females 3 25673413 European NA \n",
"4 NA 0 28892062 East Asian NA \n",
"\n",
" node.sample_size node.nsnp node.trait node.id \\\n",
"0 315347 27854527 Body mass index ebi-a-GCST006368 \n",
"1 339224 2555511 Body mass index ieu-a-2 \n",
"2 152893 2477659 Body mass index ieu-a-785 \n",
"3 322154 2554668 Body mass index ieu-a-835 \n",
"4 158284 5952516 Body mass index ebi-a-GCST004904 \n",
"\n",
" node.subcategory node.category node.sd \n",
"0 NA NA NaN \n",
"1 Anthropometric Risk factor 4.77 \n",
"2 Anthropometric Risk factor 4.77 \n",
"3 Anthropometric Risk factor 4.77 \n",
"4 NA NA NaN "
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 2.1 Getting GWAS studies from EpiGraphDB\n",
"\n",
"# Create a dictionary for the parameters to be passed\n",
"params = {\n",
" 'name':'Body mass index'\n",
"}\n",
"\n",
"# Send the request\n",
"endpoint = '/meta/nodes/Gwas/search'\n",
"response_object = requests.get(API_URL + endpoint, params=params)\n",
"response_object.raise_for_status()\n",
"\n",
"# Store the results of the query and display\n",
"result = response_object.json()['results']\n",
"gwas_df = pd.json_normalize(result)\n",
"\n",
"gwas_df.head()"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" exposure.id | \n",
" exposure.trait | \n",
" outcome.id | \n",
" outcome.trait | \n",
" mr.b | \n",
" mr.se | \n",
" mr.pval | \n",
" mr.method | \n",
" mr.selection | \n",
" mr.moescore | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" ieu-a-2 | \n",
" Body mass index | \n",
" ukb-a-74 | \n",
" Non-cancer illness code self-reported: diabetes | \n",
" 0.034559 | \n",
" 0.002418 | \n",
" 0.0 | \n",
" FE IVW | \n",
" DF | \n",
" 0.93 | \n",
"
\n",
" \n",
" 1 | \n",
" ieu-a-2 | \n",
" Body mass index | \n",
" ukb-a-388 | \n",
" Hip circumference | \n",
" 0.724105 | \n",
" 0.026588 | \n",
" 0.0 | \n",
" Simple median | \n",
" Tophits | \n",
" 0.95 | \n",
"
\n",
" \n",
" 2 | \n",
" ieu-a-2 | \n",
" Body mass index | \n",
" ukb-a-382 | \n",
" Waist circumference | \n",
" 0.656440 | \n",
" 0.024496 | \n",
" 0.0 | \n",
" Simple median | \n",
" Tophits | \n",
" 0.94 | \n",
"
\n",
" \n",
" 3 | \n",
" ieu-a-2 | \n",
" Body mass index | \n",
" ukb-a-35 | \n",
" Comparative height size at age 10 | \n",
" 0.136684 | \n",
" 0.007909 | \n",
" 0.0 | \n",
" FE IVW | \n",
" Tophits | \n",
" 0.94 | \n",
"
\n",
" \n",
" 4 | \n",
" ieu-a-2 | \n",
" Body mass index | \n",
" ukb-a-34 | \n",
" Comparative body size at age 10 | \n",
" 0.365580 | \n",
" 0.023556 | \n",
" 0.0 | \n",
" Simple median | \n",
" HF | \n",
" 0.87 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" exposure.id exposure.trait outcome.id \\\n",
"0 ieu-a-2 Body mass index ukb-a-74 \n",
"1 ieu-a-2 Body mass index ukb-a-388 \n",
"2 ieu-a-2 Body mass index ukb-a-382 \n",
"3 ieu-a-2 Body mass index ukb-a-35 \n",
"4 ieu-a-2 Body mass index ukb-a-34 \n",
"\n",
" outcome.trait mr.b mr.se \\\n",
"0 Non-cancer illness code self-reported: diabetes 0.034559 0.002418 \n",
"1 Hip circumference 0.724105 0.026588 \n",
"2 Waist circumference 0.656440 0.024496 \n",
"3 Comparative height size at age 10 0.136684 0.007909 \n",
"4 Comparative body size at age 10 0.365580 0.023556 \n",
"\n",
" mr.pval mr.method mr.selection mr.moescore \n",
"0 0.0 FE IVW DF 0.93 \n",
"1 0.0 Simple median Tophits 0.95 \n",
"2 0.0 Simple median Tophits 0.94 \n",
"3 0.0 FE IVW Tophits 0.94 \n",
"4 0.0 Simple median HF 0.87 "
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 2.2 Getting MR results for a trait\n",
"\n",
"# Set parameters\n",
"params = {'exposure_trait': 'Body mass index',\n",
" 'pval_threshold': 1e-10}\n",
"\n",
"# Send request\n",
"endpoint = '/mr'\n",
"response_object = requests.get(API_URL + endpoint, params=params)\n",
"response_object.raise_for_status()\n",
"\n",
"# Store and display results\n",
"result = response_object.json()['results']\n",
"BMI_MR_df = pd.json_normalize(result) \n",
"\n",
"BMI_MR_df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The dataframe above displays the results of our query. We requested all traits for which an MR analysis using body mass index as the exposure variable returned a causal estimate with a p-value lower than 1e-10. Information regarding the specific MR parameters, as well as the exposure and outcome variables, has been displayed in the table for all traits that matched our search conditions.\n",
"\n",
"In the parameters we set in 2.2, another viable parameter name is `'outcome_trait'` which takes the same type of values as `'exposure_trait'`. Either one or both of these parameters can be passed during an MR query, which allows users to refine which results are returned to them depending on their own analytical preferences."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"***\n",
"## 3. Looking for literature evidence\n",
"\n",
"Accessing information in the literature is a ubiquitous task in research, be it for novel hypothesis generation or as part of evidence triangulation. EpiGraphDB facilitates fast processing of this information by allowing access to a host of literature-mined relationships that have been structured into semantic triples. These take the general form *(subject, predicate, object)* and have been generated using contemporary natural language processing techniques applied to a massive amount of published biomedical research papers. In the following section we will query the API for the literature relationship between a given gene and an outcome trait."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" pubmed_id | \n",
" gene.name | \n",
" st.predicate | \n",
" st.object_name | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" [17484863, 21155887] | \n",
" IL23R | \n",
" NEG_ASSOCIATED_WITH | \n",
" Inflammatory Bowel Diseases | \n",
"
\n",
" \n",
" 1 | \n",
" [27852544] | \n",
" IL23R | \n",
" AFFECTS | \n",
" Inflammatory Bowel Diseases | \n",
"
\n",
" \n",
" 2 | \n",
" [17484863, 19575361, 19496308, 18383521, 18341... | \n",
" IL23R | \n",
" ASSOCIATED_WITH | \n",
" Inflammatory Bowel Diseases | \n",
"
\n",
" \n",
" 3 | \n",
" [23131344] | \n",
" IL23R | \n",
" PREDISPOSES | \n",
" Inflammatory Bowel Diseases | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" pubmed_id gene.name \\\n",
"0 [17484863, 21155887] IL23R \n",
"1 [27852544] IL23R \n",
"2 [17484863, 19575361, 19496308, 18383521, 18341... IL23R \n",
"3 [23131344] IL23R \n",
"\n",
" st.predicate st.object_name \n",
"0 NEG_ASSOCIATED_WITH Inflammatory Bowel Diseases \n",
"1 AFFECTS Inflammatory Bowel Diseases \n",
"2 ASSOCIATED_WITH Inflammatory Bowel Diseases \n",
"3 PREDISPOSES Inflammatory Bowel Diseases "
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Establish parameters\n",
"params = {\n",
" 'gene_name': \"IL23R\",\n",
" 'object_name': \"Inflammatory bowel disease\"\n",
"}\n",
"\n",
"# Send the request\n",
"endpoint = \"/literature/gene\"\n",
"response_object = requests.get(API_URL + endpoint, params=params)\n",
"response_object.raise_for_status()\n",
"\n",
"# Store the results of the query and display\n",
"result = response_object.json()['results']\n",
"lit_df = pd.json_normalize(result) \n",
"\n",
"lit_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The dataframe outputted above shows the results of our query- four unique predicates were found between the gene *IL23R* and the trait *Inflammatory bowel disease* and are displayed in the `st.predicate` column. Our leftmost column contains the pubmed IDs of the papers from which this triple was derived. These IDs allow us to access the respective papers by navigating to `https://pubmed.ncbi.nlm.nih.gov/*insert_pubmed_id_here*`. In this particular case it seems that *ASSOCIATED_WITH* is the most common predicate linking our gene to the trait, but we can't see exactly how many papers there are due to how pandas displays lists. Let's add a paper count to the dataframe."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" pubmed_id | \n",
" gene.name | \n",
" st.predicate | \n",
" st.object_name | \n",
" publication_count | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" [17484863, 21155887] | \n",
" IL23R | \n",
" NEG_ASSOCIATED_WITH | \n",
" Inflammatory Bowel Diseases | \n",
" 2 | \n",
"
\n",
" \n",
" 1 | \n",
" [27852544] | \n",
" IL23R | \n",
" AFFECTS | \n",
" Inflammatory Bowel Diseases | \n",
" 1 | \n",
"
\n",
" \n",
" 2 | \n",
" [17484863, 19575361, 19496308, 18383521, 18341... | \n",
" IL23R | \n",
" ASSOCIATED_WITH | \n",
" Inflammatory Bowel Diseases | \n",
" 21 | \n",
"
\n",
" \n",
" 3 | \n",
" [23131344] | \n",
" IL23R | \n",
" PREDISPOSES | \n",
" Inflammatory Bowel Diseases | \n",
" 1 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" pubmed_id gene.name \\\n",
"0 [17484863, 21155887] IL23R \n",
"1 [27852544] IL23R \n",
"2 [17484863, 19575361, 19496308, 18383521, 18341... IL23R \n",
"3 [23131344] IL23R \n",
"\n",
" st.predicate st.object_name publication_count \n",
"0 NEG_ASSOCIATED_WITH Inflammatory Bowel Diseases 2 \n",
"1 AFFECTS Inflammatory Bowel Diseases 1 \n",
"2 ASSOCIATED_WITH Inflammatory Bowel Diseases 21 \n",
"3 PREDISPOSES Inflammatory Bowel Diseases 1 "
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"counts = [len(papers_list) for papers_list in lit_df['pubmed_id']]\n",
"lit_df['publication_count'] = counts\n",
"\n",
"lit_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"-----\n",
"## 4. EpiGraphDB node search\n",
"\n",
"EpiGraphDB stores data as nodes (entities) and edges (relationships) of a wide range of types. The `/meta` endpoints of the API offer us information about the structure of the graph itself- for example, the available classes of nodes can be listed through the `/meta/nodes/list` endpoint. Let’s do that now:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['Gwas',\n",
" 'Disease',\n",
" 'Drug',\n",
" 'Efo',\n",
" 'Event',\n",
" 'Gene',\n",
" 'Tissue',\n",
" 'Literature',\n",
" 'Pathway',\n",
" 'Protein',\n",
" 'SemmedTerm',\n",
" 'Variant']"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 4.1 Getting a list of available meta-nodes\n",
"\n",
"# Send the request\n",
"endpoint = \"/meta/nodes/list\"\n",
"response_object = requests.get(API_URL + endpoint)\n",
"response_object.raise_for_status()\n",
"\n",
"# Store the results of the query and display\n",
"result = response_object.json()\n",
"\n",
"result"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This list above corresponds to EpiGraphDB's meta nodes, whose documentation can be found [here](https://docs.epigraphdb.org/graph-database/meta-nodes/) along with their available properties.\n",
"\n",
"In the following, we will demonstrate how we can search by name for a node of interest, using the endpoint `/meta/nodes/{meta_node}/search`, where viable values for `{meta_node}` are those terms listed above.\n"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" node.trait | \n",
" node.id | \n",
" node.sample_size | \n",
" node.year | \n",
" node.author | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Breast cancer | \n",
" ebi-a-GCST007236 | \n",
" 89677 | \n",
" 2015 | \n",
" Michailidou K | \n",
"
\n",
" \n",
" 1 | \n",
" Breast cancer | \n",
" ebi-a-GCST004988 | \n",
" 139274 | \n",
" 2017 | \n",
" Michailidou K | \n",
"
\n",
" \n",
" 2 | \n",
" Breast cancer (Combined Oncoarray; iCOGS; GWAS... | \n",
" ieu-a-1126 | \n",
" 228951 | \n",
" 2017 | \n",
" Michailidou K | \n",
"
\n",
" \n",
" 3 | \n",
" Breast cancer (GWAS) | \n",
" ieu-a-1131 | \n",
" 32498 | \n",
" 2017 | \n",
" Michailidou K | \n",
"
\n",
" \n",
" 4 | \n",
" Breast cancer (GWAS) | \n",
" ieu-a-1168 | \n",
" 33832 | \n",
" 2015 | \n",
" Michailidou K | \n",
"
\n",
" \n",
" 5 | \n",
" Breast cancer (Oncoarray) | \n",
" ieu-a-1129 | \n",
" 106776 | \n",
" 2017 | \n",
" Michailidou K | \n",
"
\n",
" \n",
" 6 | \n",
" Breast cancer (Survival) | \n",
" ieu-a-1165 | \n",
" 37954 | \n",
" 2015 | \n",
" Guo Q | \n",
"
\n",
" \n",
" 7 | \n",
" Breast cancer (iCOGS) | \n",
" ieu-a-1162 | \n",
" 89677 | \n",
" 2015 | \n",
" Michailidou K | \n",
"
\n",
" \n",
" 8 | \n",
" Breast cancer (iCOGS) | \n",
" ieu-a-1130 | \n",
" 89677 | \n",
" 2017 | \n",
" Michailidou K | \n",
"
\n",
" \n",
" 9 | \n",
" Breast cancer anti-estrogen resistance protein 3 | \n",
" prot-a-234 | \n",
" 3301 | \n",
" 2018 | \n",
" Sun BB | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" node.trait node.id \\\n",
"0 Breast cancer ebi-a-GCST007236 \n",
"1 Breast cancer ebi-a-GCST004988 \n",
"2 Breast cancer (Combined Oncoarray; iCOGS; GWAS... ieu-a-1126 \n",
"3 Breast cancer (GWAS) ieu-a-1131 \n",
"4 Breast cancer (GWAS) ieu-a-1168 \n",
"5 Breast cancer (Oncoarray) ieu-a-1129 \n",
"6 Breast cancer (Survival) ieu-a-1165 \n",
"7 Breast cancer (iCOGS) ieu-a-1162 \n",
"8 Breast cancer (iCOGS) ieu-a-1130 \n",
"9 Breast cancer anti-estrogen resistance protein 3 prot-a-234 \n",
"\n",
" node.sample_size node.year node.author \n",
"0 89677 2015 Michailidou K \n",
"1 139274 2017 Michailidou K \n",
"2 228951 2017 Michailidou K \n",
"3 32498 2017 Michailidou K \n",
"4 33832 2015 Michailidou K \n",
"5 106776 2017 Michailidou K \n",
"6 37954 2015 Guo Q \n",
"7 89677 2015 Michailidou K \n",
"8 89677 2017 Michailidou K \n",
"9 3301 2018 Sun BB "
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 4.2 Searching for specific entities by name\n",
"\n",
"# Set params \n",
"params = {\n",
" 'name': 'breast cancer'\n",
"}\n",
"\n",
"# Make request\n",
"meta_node = 'Gwas'\n",
"endpoint = f\"/meta/nodes/{meta_node}/search\"\n",
"response_object = requests.get(API_URL + endpoint, params=params)\n",
"response_object.raise_for_status()\n",
"\n",
"# Convert to pandas\n",
"results = pd.json_normalize(response_object.json()['results'])\n",
"target_node_id = results['node.id'][3] # Store one ID for use in the next cell\n",
"\n",
"results[['node.trait', 'node.id', 'node.sample_size', 'node.year', 'node.author']]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Above we used the `name` parameter of the endpoint to search for any GWAS nodes that fuzzily matched our specified string. Several did, and some of their basic node properties are displayed above. Fuzzy matching is useful because you don't need to know the exact name of the entity or its ID in order to look it up. \n",
"\n",
"On the other hand, once you have identified your entity of interest, it is often sensible to move forward using the node's ID for the sake of unambiguity. Fortunately we can also search for traits using their ID, as demonstrated below."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" node.ncase | \n",
" node.access | \n",
" node.year | \n",
" node.mr | \n",
" node.author | \n",
" node.consortium | \n",
" node.sex | \n",
" node.priority | \n",
" node.pmid | \n",
" node.population | \n",
" node.unit | \n",
" node.sample_size | \n",
" node.nsnp | \n",
" node.ncontrol | \n",
" node.trait | \n",
" node.id | \n",
" node.subcategory | \n",
" node.category | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 14910 | \n",
" public | \n",
" 2017 | \n",
" 1 | \n",
" Michailidou K | \n",
" NA | \n",
" Females | \n",
" 1 | \n",
" 29059683 | \n",
" European | \n",
" NA | \n",
" 32498 | \n",
" 10680257 | \n",
" 17588 | \n",
" Breast cancer (GWAS) | \n",
" ieu-a-1131 | \n",
" Cancer | \n",
" Disease | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" node.ncase node.access node.year node.mr node.author node.consortium \\\n",
"0 14910 public 2017 1 Michailidou K NA \n",
"\n",
" node.sex node.priority node.pmid node.population node.unit node.sample_size \\\n",
"0 Females 1 29059683 European NA 32498 \n",
"\n",
" node.nsnp node.ncontrol node.trait node.id node.subcategory \\\n",
"0 10680257 17588 Breast cancer (GWAS) ieu-a-1131 Cancer \n",
"\n",
" node.category \n",
"0 Disease "
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 4.3 Searching for a node by ID\n",
"\n",
"# Set params\n",
"params = {\n",
" 'id': target_node_id # From previous cell\n",
"}\n",
"\n",
"# Make request\n",
"meta_node = 'Gwas'\n",
"endpoint = f\"/meta/nodes/{meta_node}/search\"\n",
"response_object = requests.get(API_URL + endpoint, params=params)\n",
"response_object.raise_for_status()\n",
"\n",
"# Convert to pandas\n",
"results = pd.json_normalize(response_object.json()['results'])\n",
"\n",
"results"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"***\n",
"## Advanced examples- Cypher\n",
"\n",
"Until now, to get information from the platform we have been simply creating a dictionary or JSON object containing our parameters and then sending it to the correct endpoint of the API using the `requests` library. This is fine practice and the API has been designed specifically to allow this method of use, as we have (inexhaustively) demonstrated above. It works because the API automatically converts the HTTP requests that it receives into a Cypher query, which it then passes to the Neo4j database on which EpiGraphDB is built. The database passes back the result of the query, which is then returned to us in Python as a response object. Each response object contains metadata that includes the exact Cypher query that was called on the database, as shown in the cell below."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"MATCH (gene:Gene)-[gp:GENE_TO_PROTEIN]-(protein:Protein) WHERE gene.name IN ['TP53'] RETURN gene {.ensembl_id, .name}, protein {.uniprot_id}\n"
]
}
],
"source": [
"# 4.1 Cypher\n",
"\n",
"params = {\n",
" \"gene_name_list\": [\n",
" \"TP53\"\n",
" ]\n",
"}\n",
"json_params = json.dumps(params)\n",
"endpoint = '/mappings/gene-to-protein'\n",
"response_object = requests.post(API_URL + endpoint, data=json_params)\n",
"response_object.raise_for_status()\n",
"\n",
"# Extract and print the Cypher query\n",
"cypher_query = response_object.json()['metadata']['query']\n",
"print(cypher_query)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The text printed above is the exact Cypher query that was run in section 1.1, behind the scenes. The basic structure of these queries is as follows:\n",
"\n",
" MATCH subgraph\n",
"\n",
" WHERE condition\n",
"\n",
" RETURN data\n",
"\n",
"Note that the subgraph should take this general form: *(node)-[relationship]-(node)*, but for both nodes and relationships we write them as `my_variable_name:Meta_node` so that we can access their properties through the variable name we assigned them (my_variable_name), and use those properties to define our conditions and what data we want returned. Information on the available properties for each class of entity can be found in EpiGraphDB's documentation, specifically [here for nodes](https://docs.epigraphdb.org/graph-database/meta-nodes/) and [here for relationships](https://docs.epigraphdb.org/graph-database/meta-relationships/).\n",
"\n",
"Now let's write and send our own basic query to get traits with high genetic correlation to body mass index:"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"ename": "SyntaxError",
"evalue": "EOL while scanning string literal (, line 7)",
"output_type": "error",
"traceback": [
"\u001b[0;36m File \u001b[0;32m\"\"\u001b[0;36m, line \u001b[0;32m7\u001b[0m\n\u001b[0;31m cypher_query += ' WHERE trait1.trait = \"Body mass index (BMI)\"\u001b[0m\n\u001b[0m ^\u001b[0m\n\u001b[0;31mSyntaxError\u001b[0m\u001b[0;31m:\u001b[0m EOL while scanning string literal\n"
]
}
],
"source": [
"# 4.2 Writing custom Cypher queries\n",
"\n",
"# Define the target subgraph\n",
"cypher_query = 'MATCH (trait1:Gwas)-[corr:BN_GEN_COR]-(trait2:Gwas)'\n",
"\n",
"# Add conditions to the query\n",
"cypher_query += ' WHERE trait1.trait = \"Body mass index (BMI)\" \n",
"cypher_query += ' AND corr.rg > 0.9'\n",
"\n",
"# Add which data we want returned\n",
"cypher_query += ' RETURN trait1, trait2, corr {.rg, .p}'\n",
"\n",
"# Put our query into the correct format for a POST request\n",
"params = json.dumps({\n",
" 'query': cypher_query\n",
"})\n",
"\n",
"# Define the target endpoint and send the request\n",
"endpoint = '/cypher'\n",
"response_object = requests.post(API_URL + endpoint, data=params)\n",
"response_object.raise_for_status()\n",
"\n",
"# Display the returned data\n",
"results = response_object.json()['results']\n",
"results_df = pd.json_normalize(results)\n",
"\n",
"results_df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In our Cypher query, we grabbed a subgraph from the database that comprised nodes representing biomedical traits, with edges between them representing their genetic correlation. The subgraph was then filtered to select any node-edge-node triples where the first node had the `.trait` property of \"Body mass index (BMI)\", and where the edge between the nodes had a `.rg` (genetic correlation score) value greater than 0.9. We then asked Neo4j to return us the names of the two traits, as well as the score and p-value of the correlation between the two, for all triples not filtered out by our conditions. Finally, we converted the returned dictionary to a dataframe for ease of viewing.\n",
"\n",
"For more detailed information on Cypher queries, please refer to the [official documentation](https://neo4j.com/developer/cypher/)."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.9"
}
},
"nbformat": 4,
"nbformat_minor": 4
}