{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Getting started with EpiGraphDB in Python\n", "\n", "This notebook is provided as a brief introductory guide to working with the EpiGraphDB platform through Python. Here we will demonstrate a few basic operations that can be carried out using the platform, but for more advanced methods please refer to the [API endpoint documentation](http://docs.epigraphdb.org/api/api-endpoints/).\n", "\n", "A Python wrapper for EpiGraphDB's API is currently in the works, but for now we will be querying it directly using the `requests` library- knowledge of this package is advantageous but not essential." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import requests" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, we will ping the API to check our connection:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "If this line gets printed, ping was sucessful.\n" ] } ], "source": [ "# Store our API URL as a string for future use\n", "API_URL = \"https://api.epigraphdb.org\"\n", "\n", "# Here we use the .get() method to send a GET request to the /ping endpoint of the API\n", "endpoint = '/ping'\n", "response_object = requests.get(API_URL + endpoint) \n", "\n", "# Check that the ping was sucessful\n", "response_object.raise_for_status() \n", "print(\"If this line gets printed, ping was sucessful.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***\n", "## 1. Using EpiGraphDB to obtain biological mappings\n", "\n", "In this first section, we will take an arbitrary list of genes and query the EpiGraph API to find the proteins that they map to. We will be using the `POST` HTTP method which requires its parameters to be passed in JSON format, a conversion that is easy to do using the `json` library. To find the correct names of the parameters that we are about to set, we can navigate to the [EpiGraphDB API documentation](http://docs.epigraphdb.org/api/api-endpoints/) and find the endpoint of interest. From there we simply read off the parameters that we want to pass, and can take a look at the example request as a reference point if needed." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
gene.namegene.ensembl_idprotein.uniprot_id
0TP53ENSG00000141510P04637
1BRCA1ENSG00000012048P38398
2TNFENSG00000232810P01375
\n", "
" ], "text/plain": [ " gene.name gene.ensembl_id protein.uniprot_id\n", "0 TP53 ENSG00000141510 P04637\n", "1 BRCA1 ENSG00000012048 P38398\n", "2 TNF ENSG00000232810 P01375" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 1.1 Mapping genes to proteins\n", "\n", "# Set parameters and convert to JSON format\n", "import json\n", "params = {\n", " \"gene_name_list\": [\n", " \"TP53\",\n", " \"BRCA1\", \n", " \"TNF\"\n", " ]\n", "}\n", "json_params = json.dumps(params)\n", "\n", "# Define which endpoint of the API we would like to connect with\n", "endpoint = '/mappings/gene-to-protein'\n", "\n", "# Send the POST request\n", "response_object = requests.post(API_URL + endpoint, data=json_params)\n", "\n", "# Check for successful request\n", "response_object.raise_for_status()\n", "\n", "# Store results in a pandas dataframe\n", "import pandas as pd\n", "results = response_object.json()['results']\n", "gene_protein_df = pd.json_normalize(results)\n", "\n", "gene_protein_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the above cell, we queried EpiGraphDB for the proteins that have been mapped to the genes *TP53*, *BRCA1*, and *TNF*. Our query went through successfully and we received an associated protein for each. The columns in our output dataframe take the general form `entity.property` and this will remain consistent throughout this notebook. \n", "\n", "Specific descriptions for the properties of each entity can be found in EpiGraphDB's [data dictionary](https://docs.epigraphdb.org/graph-database/meta-nodes/). Simply click on the relevant entity in the table of contents on the right hand side (or scroll down to the relevant section), then locate the property of interest." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
uniprot_idpathway_countpathway_reactome_id
0P046375[R-HSA-6785807, R-HSA-390471, R-HSA-5689896, R...
1P383986[R-HSA-6796648, R-HSA-1221632, R-HSA-8953750, ...
2P013753[R-HSA-6785807, R-HSA-6783783, R-HSA-5357905]
\n", "
" ], "text/plain": [ " uniprot_id pathway_count pathway_reactome_id\n", "0 P04637 5 [R-HSA-6785807, R-HSA-390471, R-HSA-5689896, R...\n", "1 P38398 6 [R-HSA-6796648, R-HSA-1221632, R-HSA-8953750, ...\n", "2 P01375 3 [R-HSA-6785807, R-HSA-6783783, R-HSA-5357905]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 1.2 Proteins to pathways\n", "\n", "# As above, this is another POST request, so we need our data in JSON format\n", "json_params = json.dumps({\n", " \"uniprot_id_list\": list(gene_protein_df['protein.uniprot_id'].values)\n", "})\n", "\n", "# Send the request\n", "endpoint = '/protein/in-pathway'\n", "response_object = requests.post(API_URL + endpoint, data=json_params)\n", "\n", "# Check for successful request\n", "response_object.raise_for_status()\n", "\n", "# Store results\n", "results = response_object.json()['results']\n", "protein_pathway_df = pd.json_normalize(results)\n", "\n", "protein_pathway_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Above, we took the proteins that had been mapped to our genes of interest and queried the platform for their associated pathway data. The API found multiple such pathways for each gene and has returned the respective reactome IDs to us as lists.\n", "\n", "\n", "It is worth noting here that so far we have only been accessing the `'results'` key in the nested dictionairy returned by the `.json()` method of our response object. The other available key is `'metadata'` (see the output below) which provides us with information about the request itself, including the specific Cypher query that the platform ran to get these results. If you would like to know more about the use of Cypher in these requests, there is a section dedicated to this at the end of this notebook." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'empty_results': False,\n", " 'query': 'MATCH p=(protein:Protein)-[r:PROTEIN_IN_PATHWAY]-(pathway:Pathway) '\n", " \"WHERE protein.uniprot_id IN ['P04637', 'P38398', 'P01375'] RETURN \"\n", " 'protein.uniprot_id AS uniprot_id, count(p) AS pathway_count, '\n", " 'collect(pathway.reactome_id) AS pathway_reactome_id',\n", " 'total_seconds': 0.005797}\n" ] } ], "source": [ "from pprint import pprint\n", "metadata = response_object.json()['metadata']\n", "\n", "pprint(metadata)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***\n", "## 2. Epidemiological relationship analysis\n", "\n", "In the cell below, we will query EpiGraphDB to get metadata relating to GWAS studies of a target trait- body mass index. Following that, queries will be performed to get pre-computed Mendelian Randomisation (MR) results involving the same trait. \n", "\n", "Here we will be using a different HTTP method than before- the `GET` method, which is in fact easier to use in Python because the parameters can be passed directly as a dictionary. To learn more about the differences between `GET` and `POST`, please see [this guide](https://www.w3schools.com/tags/ref_httpmethods.asp). " ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
node.notenode.accessnode.yearnode.mrnode.authornode.consortiumnode.sexnode.prioritynode.pmidnode.populationnode.unitnode.sample_sizenode.nsnpnode.traitnode.idnode.subcategorynode.categorynode.sd
0NApublic20181Hoffmann TJNANA030108127EuropeanNA31534727854527Body mass indexebi-a-GCST006368NANANaN
1NaNpublic20151Locke AENAMales and Females125673413MixedNA3392242555511Body mass indexieu-a-2AnthropometricRisk factor4.77
2NaNpublic20151Locke AENAMales225673413EuropeanNA1528932477659Body mass indexieu-a-785AnthropometricRisk factor4.77
3NaNpublic20151Locke AENAMales and Females325673413EuropeanNA3221542554668Body mass indexieu-a-835AnthropometricRisk factor4.77
4NApublic20171Akiyama MNANA028892062East AsianNA1582845952516Body mass indexebi-a-GCST004904NANANaN
\n", "
" ], "text/plain": [ " node.note node.access node.year node.mr node.author node.consortium \\\n", "0 NA public 2018 1 Hoffmann TJ NA \n", "1 NaN public 2015 1 Locke AE NA \n", "2 NaN public 2015 1 Locke AE NA \n", "3 NaN public 2015 1 Locke AE NA \n", "4 NA public 2017 1 Akiyama M NA \n", "\n", " node.sex node.priority node.pmid node.population node.unit \\\n", "0 NA 0 30108127 European NA \n", "1 Males and Females 1 25673413 Mixed NA \n", "2 Males 2 25673413 European NA \n", "3 Males and Females 3 25673413 European NA \n", "4 NA 0 28892062 East Asian NA \n", "\n", " node.sample_size node.nsnp node.trait node.id \\\n", "0 315347 27854527 Body mass index ebi-a-GCST006368 \n", "1 339224 2555511 Body mass index ieu-a-2 \n", "2 152893 2477659 Body mass index ieu-a-785 \n", "3 322154 2554668 Body mass index ieu-a-835 \n", "4 158284 5952516 Body mass index ebi-a-GCST004904 \n", "\n", " node.subcategory node.category node.sd \n", "0 NA NA NaN \n", "1 Anthropometric Risk factor 4.77 \n", "2 Anthropometric Risk factor 4.77 \n", "3 Anthropometric Risk factor 4.77 \n", "4 NA NA NaN " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 2.1 Getting GWAS studies from EpiGraphDB\n", "\n", "# Create a dictionary for the parameters to be passed\n", "params = {\n", " 'name':'Body mass index'\n", "}\n", "\n", "# Send the request\n", "endpoint = '/meta/nodes/Gwas/search'\n", "response_object = requests.get(API_URL + endpoint, params=params)\n", "response_object.raise_for_status()\n", "\n", "# Store the results of the query and display\n", "result = response_object.json()['results']\n", "gwas_df = pd.json_normalize(result)\n", "\n", "gwas_df.head()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
exposure.idexposure.traitoutcome.idoutcome.traitmr.bmr.semr.pvalmr.methodmr.selectionmr.moescore
0ieu-a-2Body mass indexukb-a-74Non-cancer illness code self-reported: diabetes0.0345590.0024180.0FE IVWDF0.93
1ieu-a-2Body mass indexukb-a-388Hip circumference0.7241050.0265880.0Simple medianTophits0.95
2ieu-a-2Body mass indexukb-a-382Waist circumference0.6564400.0244960.0Simple medianTophits0.94
3ieu-a-2Body mass indexukb-a-35Comparative height size at age 100.1366840.0079090.0FE IVWTophits0.94
4ieu-a-2Body mass indexukb-a-34Comparative body size at age 100.3655800.0235560.0Simple medianHF0.87
\n", "
" ], "text/plain": [ " exposure.id exposure.trait outcome.id \\\n", "0 ieu-a-2 Body mass index ukb-a-74 \n", "1 ieu-a-2 Body mass index ukb-a-388 \n", "2 ieu-a-2 Body mass index ukb-a-382 \n", "3 ieu-a-2 Body mass index ukb-a-35 \n", "4 ieu-a-2 Body mass index ukb-a-34 \n", "\n", " outcome.trait mr.b mr.se \\\n", "0 Non-cancer illness code self-reported: diabetes 0.034559 0.002418 \n", "1 Hip circumference 0.724105 0.026588 \n", "2 Waist circumference 0.656440 0.024496 \n", "3 Comparative height size at age 10 0.136684 0.007909 \n", "4 Comparative body size at age 10 0.365580 0.023556 \n", "\n", " mr.pval mr.method mr.selection mr.moescore \n", "0 0.0 FE IVW DF 0.93 \n", "1 0.0 Simple median Tophits 0.95 \n", "2 0.0 Simple median Tophits 0.94 \n", "3 0.0 FE IVW Tophits 0.94 \n", "4 0.0 Simple median HF 0.87 " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 2.2 Getting MR results for a trait\n", "\n", "# Set parameters\n", "params = {'exposure_trait': 'Body mass index',\n", " 'pval_threshold': 1e-10}\n", "\n", "# Send request\n", "endpoint = '/mr'\n", "response_object = requests.get(API_URL + endpoint, params=params)\n", "response_object.raise_for_status()\n", "\n", "# Store and display results\n", "result = response_object.json()['results']\n", "BMI_MR_df = pd.json_normalize(result) \n", "\n", "BMI_MR_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The dataframe above displays the results of our query. We requested all traits for which an MR analysis using body mass index as the exposure variable returned a causal estimate with a p-value lower than 1e-10. Information regarding the specific MR parameters, as well as the exposure and outcome variables, has been displayed in the table for all traits that matched our search conditions.\n", "\n", "In the parameters we set in 2.2, another viable parameter name is `'outcome_trait'` which takes the same type of values as `'exposure_trait'`. Either one or both of these parameters can be passed during an MR query, which allows users to refine which results are returned to them depending on their own analytical preferences." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***\n", "## 3. Looking for literature evidence\n", "\n", "Accessing information in the literature is a ubiquitous task in research, be it for novel hypothesis generation or as part of evidence triangulation. EpiGraphDB facilitates fast processing of this information by allowing access to a host of literature-mined relationships that have been structured into semantic triples. These take the general form *(subject, predicate, object)* and have been generated using contemporary natural language processing techniques applied to a massive amount of published biomedical research papers. In the following section we will query the API for the literature relationship between a given gene and an outcome trait." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pubmed_idgene.namest.predicatest.object_name
0[17484863, 21155887]IL23RNEG_ASSOCIATED_WITHInflammatory Bowel Diseases
1[27852544]IL23RAFFECTSInflammatory Bowel Diseases
2[17484863, 19575361, 19496308, 18383521, 18341...IL23RASSOCIATED_WITHInflammatory Bowel Diseases
3[23131344]IL23RPREDISPOSESInflammatory Bowel Diseases
\n", "
" ], "text/plain": [ " pubmed_id gene.name \\\n", "0 [17484863, 21155887] IL23R \n", "1 [27852544] IL23R \n", "2 [17484863, 19575361, 19496308, 18383521, 18341... IL23R \n", "3 [23131344] IL23R \n", "\n", " st.predicate st.object_name \n", "0 NEG_ASSOCIATED_WITH Inflammatory Bowel Diseases \n", "1 AFFECTS Inflammatory Bowel Diseases \n", "2 ASSOCIATED_WITH Inflammatory Bowel Diseases \n", "3 PREDISPOSES Inflammatory Bowel Diseases " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Establish parameters\n", "params = {\n", " 'gene_name': \"IL23R\",\n", " 'object_name': \"Inflammatory bowel disease\"\n", "}\n", "\n", "# Send the request\n", "endpoint = \"/literature/gene\"\n", "response_object = requests.get(API_URL + endpoint, params=params)\n", "response_object.raise_for_status()\n", "\n", "# Store the results of the query and display\n", "result = response_object.json()['results']\n", "lit_df = pd.json_normalize(result) \n", "\n", "lit_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The dataframe outputted above shows the results of our query- four unique predicates were found between the gene *IL23R* and the trait *Inflammatory bowel disease* and are displayed in the `st.predicate` column. Our leftmost column contains the pubmed IDs of the papers from which this triple was derived. These IDs allow us to access the respective papers by navigating to `https://pubmed.ncbi.nlm.nih.gov/*insert_pubmed_id_here*`. In this particular case it seems that *ASSOCIATED_WITH* is the most common predicate linking our gene to the trait, but we can't see exactly how many papers there are due to how pandas displays lists. Let's add a paper count to the dataframe." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pubmed_idgene.namest.predicatest.object_namepublication_count
0[17484863, 21155887]IL23RNEG_ASSOCIATED_WITHInflammatory Bowel Diseases2
1[27852544]IL23RAFFECTSInflammatory Bowel Diseases1
2[17484863, 19575361, 19496308, 18383521, 18341...IL23RASSOCIATED_WITHInflammatory Bowel Diseases21
3[23131344]IL23RPREDISPOSESInflammatory Bowel Diseases1
\n", "
" ], "text/plain": [ " pubmed_id gene.name \\\n", "0 [17484863, 21155887] IL23R \n", "1 [27852544] IL23R \n", "2 [17484863, 19575361, 19496308, 18383521, 18341... IL23R \n", "3 [23131344] IL23R \n", "\n", " st.predicate st.object_name publication_count \n", "0 NEG_ASSOCIATED_WITH Inflammatory Bowel Diseases 2 \n", "1 AFFECTS Inflammatory Bowel Diseases 1 \n", "2 ASSOCIATED_WITH Inflammatory Bowel Diseases 21 \n", "3 PREDISPOSES Inflammatory Bowel Diseases 1 " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "counts = [len(papers_list) for papers_list in lit_df['pubmed_id']]\n", "lit_df['publication_count'] = counts\n", "\n", "lit_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "-----\n", "## 4. EpiGraphDB node search\n", "\n", "EpiGraphDB stores data as nodes (entities) and edges (relationships) of a wide range of types. The `/meta` endpoints of the API offer us information about the structure of the graph itself- for example, the available classes of nodes can be listed through the `/meta/nodes/list` endpoint. Let’s do that now:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['Gwas',\n", " 'Disease',\n", " 'Drug',\n", " 'Efo',\n", " 'Event',\n", " 'Gene',\n", " 'Tissue',\n", " 'Literature',\n", " 'Pathway',\n", " 'Protein',\n", " 'SemmedTerm',\n", " 'Variant']" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 4.1 Getting a list of available meta-nodes\n", "\n", "# Send the request\n", "endpoint = \"/meta/nodes/list\"\n", "response_object = requests.get(API_URL + endpoint)\n", "response_object.raise_for_status()\n", "\n", "# Store the results of the query and display\n", "result = response_object.json()\n", "\n", "result" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This list above corresponds to EpiGraphDB's meta nodes, whose documentation can be found [here](https://docs.epigraphdb.org/graph-database/meta-nodes/) along with their available properties.\n", "\n", "In the following, we will demonstrate how we can search by name for a node of interest, using the endpoint `/meta/nodes/{meta_node}/search`, where viable values for `{meta_node}` are those terms listed above.\n" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
node.traitnode.idnode.sample_sizenode.yearnode.author
0Breast cancerebi-a-GCST007236896772015Michailidou K
1Breast cancerebi-a-GCST0049881392742017Michailidou K
2Breast cancer (Combined Oncoarray; iCOGS; GWAS...ieu-a-11262289512017Michailidou K
3Breast cancer (GWAS)ieu-a-1131324982017Michailidou K
4Breast cancer (GWAS)ieu-a-1168338322015Michailidou K
5Breast cancer (Oncoarray)ieu-a-11291067762017Michailidou K
6Breast cancer (Survival)ieu-a-1165379542015Guo Q
7Breast cancer (iCOGS)ieu-a-1162896772015Michailidou K
8Breast cancer (iCOGS)ieu-a-1130896772017Michailidou K
9Breast cancer anti-estrogen resistance protein 3prot-a-23433012018Sun BB
\n", "
" ], "text/plain": [ " node.trait node.id \\\n", "0 Breast cancer ebi-a-GCST007236 \n", "1 Breast cancer ebi-a-GCST004988 \n", "2 Breast cancer (Combined Oncoarray; iCOGS; GWAS... ieu-a-1126 \n", "3 Breast cancer (GWAS) ieu-a-1131 \n", "4 Breast cancer (GWAS) ieu-a-1168 \n", "5 Breast cancer (Oncoarray) ieu-a-1129 \n", "6 Breast cancer (Survival) ieu-a-1165 \n", "7 Breast cancer (iCOGS) ieu-a-1162 \n", "8 Breast cancer (iCOGS) ieu-a-1130 \n", "9 Breast cancer anti-estrogen resistance protein 3 prot-a-234 \n", "\n", " node.sample_size node.year node.author \n", "0 89677 2015 Michailidou K \n", "1 139274 2017 Michailidou K \n", "2 228951 2017 Michailidou K \n", "3 32498 2017 Michailidou K \n", "4 33832 2015 Michailidou K \n", "5 106776 2017 Michailidou K \n", "6 37954 2015 Guo Q \n", "7 89677 2015 Michailidou K \n", "8 89677 2017 Michailidou K \n", "9 3301 2018 Sun BB " ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 4.2 Searching for specific entities by name\n", "\n", "# Set params \n", "params = {\n", " 'name': 'breast cancer'\n", "}\n", "\n", "# Make request\n", "meta_node = 'Gwas'\n", "endpoint = f\"/meta/nodes/{meta_node}/search\"\n", "response_object = requests.get(API_URL + endpoint, params=params)\n", "response_object.raise_for_status()\n", "\n", "# Convert to pandas\n", "results = pd.json_normalize(response_object.json()['results'])\n", "target_node_id = results['node.id'][3] # Store one ID for use in the next cell\n", "\n", "results[['node.trait', 'node.id', 'node.sample_size', 'node.year', 'node.author']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Above we used the `name` parameter of the endpoint to search for any GWAS nodes that fuzzily matched our specified string. Several did, and some of their basic node properties are displayed above. Fuzzy matching is useful because you don't need to know the exact name of the entity or its ID in order to look it up. \n", "\n", "On the other hand, once you have identified your entity of interest, it is often sensible to move forward using the node's ID for the sake of unambiguity. Fortunately we can also search for traits using their ID, as demonstrated below." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
node.ncasenode.accessnode.yearnode.mrnode.authornode.consortiumnode.sexnode.prioritynode.pmidnode.populationnode.unitnode.sample_sizenode.nsnpnode.ncontrolnode.traitnode.idnode.subcategorynode.category
014910public20171Michailidou KNAFemales129059683EuropeanNA324981068025717588Breast cancer (GWAS)ieu-a-1131CancerDisease
\n", "
" ], "text/plain": [ " node.ncase node.access node.year node.mr node.author node.consortium \\\n", "0 14910 public 2017 1 Michailidou K NA \n", "\n", " node.sex node.priority node.pmid node.population node.unit node.sample_size \\\n", "0 Females 1 29059683 European NA 32498 \n", "\n", " node.nsnp node.ncontrol node.trait node.id node.subcategory \\\n", "0 10680257 17588 Breast cancer (GWAS) ieu-a-1131 Cancer \n", "\n", " node.category \n", "0 Disease " ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 4.3 Searching for a node by ID\n", "\n", "# Set params\n", "params = {\n", " 'id': target_node_id # From previous cell\n", "}\n", "\n", "# Make request\n", "meta_node = 'Gwas'\n", "endpoint = f\"/meta/nodes/{meta_node}/search\"\n", "response_object = requests.get(API_URL + endpoint, params=params)\n", "response_object.raise_for_status()\n", "\n", "# Convert to pandas\n", "results = pd.json_normalize(response_object.json()['results'])\n", "\n", "results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***\n", "## Advanced examples- Cypher\n", "\n", "Until now, to get information from the platform we have been simply creating a dictionary or JSON object containing our parameters and then sending it to the correct endpoint of the API using the `requests` library. This is fine practice and the API has been designed specifically to allow this method of use, as we have (inexhaustively) demonstrated above. It works because the API automatically converts the HTTP requests that it receives into a Cypher query, which it then passes to the Neo4j database on which EpiGraphDB is built. The database passes back the result of the query, which is then returned to us in Python as a response object. Each response object contains metadata that includes the exact Cypher query that was called on the database, as shown in the cell below." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MATCH (gene:Gene)-[gp:GENE_TO_PROTEIN]-(protein:Protein) WHERE gene.name IN ['TP53'] RETURN gene {.ensembl_id, .name}, protein {.uniprot_id}\n" ] } ], "source": [ "# 4.1 Cypher\n", "\n", "params = {\n", " \"gene_name_list\": [\n", " \"TP53\"\n", " ]\n", "}\n", "json_params = json.dumps(params)\n", "endpoint = '/mappings/gene-to-protein'\n", "response_object = requests.post(API_URL + endpoint, data=json_params)\n", "response_object.raise_for_status()\n", "\n", "# Extract and print the Cypher query\n", "cypher_query = response_object.json()['metadata']['query']\n", "print(cypher_query)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The text printed above is the exact Cypher query that was run in section 1.1, behind the scenes. The basic structure of these queries is as follows:\n", "\n", " MATCH subgraph\n", "\n", " WHERE condition\n", "\n", " RETURN data\n", "\n", "Note that the subgraph should take this general form: *(node)-[relationship]-(node)*, but for both nodes and relationships we write them as `my_variable_name:Meta_node` so that we can access their properties through the variable name we assigned them (my_variable_name), and use those properties to define our conditions and what data we want returned. Information on the available properties for each class of entity can be found in EpiGraphDB's documentation, specifically [here for nodes](https://docs.epigraphdb.org/graph-database/meta-nodes/) and [here for relationships](https://docs.epigraphdb.org/graph-database/meta-relationships/).\n", "\n", "Now let's write and send our own basic query to get traits with high genetic correlation to body mass index:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "ename": "SyntaxError", "evalue": "EOL while scanning string literal (, line 7)", "output_type": "error", "traceback": [ "\u001b[0;36m File \u001b[0;32m\"\"\u001b[0;36m, line \u001b[0;32m7\u001b[0m\n\u001b[0;31m cypher_query += ' WHERE trait1.trait = \"Body mass index (BMI)\"\u001b[0m\n\u001b[0m ^\u001b[0m\n\u001b[0;31mSyntaxError\u001b[0m\u001b[0;31m:\u001b[0m EOL while scanning string literal\n" ] } ], "source": [ "# 4.2 Writing custom Cypher queries\n", "\n", "# Define the target subgraph\n", "cypher_query = 'MATCH (trait1:Gwas)-[corr:BN_GEN_COR]-(trait2:Gwas)'\n", "\n", "# Add conditions to the query\n", "cypher_query += ' WHERE trait1.trait = \"Body mass index (BMI)\" \n", "cypher_query += ' AND corr.rg > 0.9'\n", "\n", "# Add which data we want returned\n", "cypher_query += ' RETURN trait1, trait2, corr {.rg, .p}'\n", "\n", "# Put our query into the correct format for a POST request\n", "params = json.dumps({\n", " 'query': cypher_query\n", "})\n", "\n", "# Define the target endpoint and send the request\n", "endpoint = '/cypher'\n", "response_object = requests.post(API_URL + endpoint, data=params)\n", "response_object.raise_for_status()\n", "\n", "# Display the returned data\n", "results = response_object.json()['results']\n", "results_df = pd.json_normalize(results)\n", "\n", "results_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In our Cypher query, we grabbed a subgraph from the database that comprised nodes representing biomedical traits, with edges between them representing their genetic correlation. The subgraph was then filtered to select any node-edge-node triples where the first node had the `.trait` property of \"Body mass index (BMI)\", and where the edge between the nodes had a `.rg` (genetic correlation score) value greater than 0.9. We then asked Neo4j to return us the names of the two traits, as well as the score and p-value of the correlation between the two, for all triples not filtered out by our conditions. Finally, we converted the returned dictionary to a dataframe for ease of viewing.\n", "\n", "For more detailed information on Cypher queries, please refer to the [official documentation](https://neo4j.com/developer/cypher/)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.9" } }, "nbformat": 4, "nbformat_minor": 4 }