{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Abstract\n",
"\n",
"**Author:** [Charles Tapley Hoyt](https://github.com/cthoyt)\n",
"\n",
"**Estimated Run Time:** 4 minutes\n",
"\n",
"This notebook calculates the concordance between differential gene expression and the causal relationships between their protein products. This analysis assumes the forward hypothesis, under which gene expression is believed to be correlated with protein activity. While there are many faults to this assumption, it directly enables a very simple analysis of a knowledge assembly. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Notebook Imports"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import logging\n",
"import os\n",
"import time\n",
"\n",
"import pandas as pd\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"from pandas.tools.plotting import scatter_matrix\n",
"from matplotlib_venn import venn2\n",
"import seaborn as sns\n",
"\n",
"import pybel\n",
"import pybel_tools as pbt\n",
"from pybel_tools.analysis.concordance import *\n",
"from pybel.constants import *\n",
"from pybel.canonicalize import calculate_canonical_name\n",
"from pybel_tools.visualization import to_jupyter"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"%config InlineBackend.figure_format = 'svg'\n",
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Notebook Provenance\n",
"\n",
"The time of execution, random number generator seed, and the versions of the software packages used are displayed explicitly."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Sun Aug 27 11:39:19 2017'"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"time.asctime()"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# seed the random number generator\n",
"import random\n",
"random.seed(127)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'0.7.3-dev'"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pybel.__version__"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'0.2.2-dev'"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pbt.__version__"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Local Path Definitions\n",
"\n",
"To make this notebook interoperable across many machines, locations to the repositories that contain the data used in this notebook are referenced from the environment, set in `~/.bashrc` to point to the place where the repositories have been cloned. Assuming the repositories have been `git clone`'d into the `~/dev` folder, the entries in `~/.bashrc` should look like:\n",
"\n",
"```bash\n",
"...\n",
"export BMS_BASE=~/dev/bms\n",
"...\n",
"```\n",
"\n",
"#### BMS \n",
"\n",
"The biological model store (BMS) is the internal Fraunhofer SCAI repository for keeping BEL models under version control. It can be downloaded from https://tor-2.scai.fraunhofer.de/gf/project/bms/"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"bms_base = os.environ['BMS_BASE']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### OwnCloud\n",
"\n",
"The differential gene expression data used in this notebook is currently not published, and is obfuscated with reference through our team's internal data storage system with [OwnCloud](https://owncloud.org/)."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"owncloud_base = os.environ['OWNCLOUD_BASE']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Data\n",
"\n",
"## Alzheimer's Disease Knowledge Assembly\n",
"\n",
"The Alzheimer's Disease Knowledge Assembly has been precompiled with the following command line script, and will be loaded from this format for improved performance. In general, derived data, such as the gpickle representation of a BEL script, are not saved under version control to ensure that the most up-to-date data is always used.\n",
"\n",
"```sh\n",
"pybel convert --path \"$BMS_BASE/aetionomy/alzheimers.bel\" --pickle \"$BMS_BASE/aetionomy/alzheimers.gpickle\"\n",
"```\n",
"\n",
"The BEL script can also be compiled from inside this notebook with the following python code:\n",
"\n",
"```python\n",
">>> import os\n",
">>> import pybel\n",
">>> # Input from BEL script\n",
">>> bel_path = os.path.join(bms_base, 'aetionomy', 'alzheimers.bel')\n",
">>> graph = pybel.from_path(bel_path)\n",
">>> # Output to gpickle for fast loading later\n",
">>> pickle_path = os.path.join(bms_base, 'aetionomy', 'alzheimers.gpickle')\n",
">>> pybel.to_pickle(graph, pickle_path)\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"pickle_path = os.path.join(bms_base, 'aetionomy', 'alzheimers', 'alzheimers.gpickle')"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"graph = pybel.from_pickle(pickle_path)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'4.0.3'"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"graph.version"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"All orthologies are discarded before analysis."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"pbt.filters.remove_nodes_by_namespace(graph, {'MGI', 'RGD'})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Following the forward hypothesis, the knowledge graph is collapsed to genes using [pbt.mutation.collapse_by_central_dogma_to_genes](http://pybel-tools.readthedocs.io/en/latest/mutation.html#pybel_tools.mutation.collapse_by_central_dogma_to_genes)."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"pbt.mutation.collapse_by_central_dogma_to_genes(graph)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"An additional assumption is made about the activities of variants of proteins and genes. All variants are collapsed to the reference gene."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"pbt.mutation.rewire_variants_to_genes(graph)"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Nodes: 3701\n",
"Edges: 19562\n",
"Citations: 1760\n",
"Authors: 9305\n",
"Network density: 0.0014285401315933604\n",
"Components: 70\n",
"Average degree: 5.2855984868954335\n",
"Compilation warnings: 2578\n"
]
}
],
"source": [
"pbt.summary.print_summary(graph)"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"pbt.summary.plot_summary(graph, plt, figsize=(10, 4))\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Differential Gene Expression\n",
"\n",
"Differential gene expression data can be obtained from many sources, including ADNI and other large clinical studies. This analysis is concerned with the log-fold-changes on each gene, and not necessarily the p-value. This is better as a data-driven process becuase it does not require a model or multiple hypothesis testing on raw data."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"data_path = os.path.join(owncloud_base, 'alzheimers', 'SevAD.csv')\n",
"target_columns = ['Gene.symbol', 'logFC']"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Gene.symbol | \n",
" logFC | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" ZNF616 | \n",
" -4.244691 | \n",
"
\n",
" \n",
" | 1 | \n",
" DEFB125 | \n",
" 3.974393 | \n",
"
\n",
" \n",
" | 3 | \n",
" SNAP23 | \n",
" 3.337636 | \n",
"
\n",
" \n",
" | 4 | \n",
" PHLDB2 | \n",
" 3.192559 | \n",
"
\n",
" \n",
" | 5 | \n",
" LOC389895 | \n",
" -4.296850 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/latex": [
"\\begin{center}{\\begin{tabular}{llr}\n",
"\\toprule\n",
"{} & Gene.symbol & logFC \\\\\n",
"\\midrule\n",
"0 & ZNF616 & -4.244691 \\\\\n",
"1 & DEFB125 & 3.974393 \\\\\n",
"3 & SNAP23 & 3.337636 \\\\\n",
"4 & PHLDB2 & 3.192559 \\\\\n",
"5 & LOC389895 & -4.296850 \\\\\n",
"\\bottomrule\n",
"\\end{tabular}\n",
"}\\end{center}"
],
"text/plain": [
" Gene.symbol logFC\n",
"0 ZNF616 -4.244691\n",
"1 DEFB125 3.974393\n",
"3 SNAP23 3.337636\n",
"4 PHLDB2 3.192559\n",
"5 LOC389895 -4.296850"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.read_csv(data_path)\n",
"df = df.loc[df['Gene.symbol'].notnull(), target_columns]\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A histogram of the log-fold-changes shows that the data are normally distributed, as expecOn the left, the number of shared elements in the knowledge assembly and differential gene data set are counted.\n",
"\n",
"On the right, a histogram of the log-fold-changes shows that the data are normally distributed, as expected for differential gene expression data.ted for differential gene expression data."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"data = {k: v for _, k, v in df.itertuples()}"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"hgnc_names = pbt.summary.get_names_by_namespace(graph, 'HGNC')\n",
"df_names = set(df['Gene.symbol'])\n",
"overlapping_hgnc_names = hgnc_names & df_names"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"fix, (rax, lax) = plt.subplots(1, 2, figsize=(10, 3))\n",
"\n",
"lax.set_title('Distribution of Log-Fold-Changes in E-GEOD-5281')\n",
"lax.set_xlabel('Log-Fold-Change')\n",
"lax.set_ylabel('Frequency')\n",
"lax.hist(list(data.values()))\n",
"\n",
"rax.set_title('Gene Overlap')\n",
"venn2([hgnc_names, df_names], set_labels=[\"Alzheimer's Disease KA\", 'E-GEOD-5281'], ax=rax)\n",
"plt.tight_layout()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Analysis"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data Integration\n",
"\n",
"Finally, the differential gene expression data are ovelayed on the BEL graph with [pbt.integration.overlay_type_data](http://pybel-tools.readthedocs.io/en/latest/integration.html#pybel_tools.integration.overlay_type_data)"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"key = 'weight'\n",
"cutoff = 0.3"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"pbt.integration.overlay_type_data(graph, data, key, GENE, 'HGNC', overwrite=False, impute=0)"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dict_keys([('Pathology', 'DO', \"Alzheimer's disease\"), ('Gene', 'dbSNP', 'rs3757536')])"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"graph.edge[graph.nodes()[25]].keys()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Concordance\n",
"\n",
"The effect of the cutoff is explored when calculating the concordance of the full knowledge assembly."
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"cutoffs = np.linspace(0, 3, 50)\n",
"\n",
"plt.plot(cutoffs, [\n",
" calculate_concordance(graph, key, cutoff=cutoff)\n",
" for cutoff in cutoffs\n",
"])\n",
"\n",
"plt.title('Effect of Cutoff on Concordance \\nfor {}'.format(graph))\n",
"plt.ylabel('Concordance')\n",
"plt.xlabel('Cutoff')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Stratify by Subgraph\n",
"\n",
"The distribution of the concordance values across the stratified subgraphs is displayed below."
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"concordance_df = pd.DataFrame.from_dict(\n",
" {\n",
" value.replace(' subgraph', ''): calculate_concordance_helper(subgraph, key, cutoff)\n",
" for value, subgraph in pbt.selection.get_subgraphs_by_annotation(graph, 'Subgraph').items()\n",
" }, \n",
" orient='index'\n",
")\n",
"\n",
"concordance_df.columns = 'Correct', 'Incorrect', 'Ambiguous', 'Unassigned'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" subgraphs that contain at least one correct or one incorrect relation are shown."
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Correct | \n",
" Incorrect | \n",
" Ambiguous | \n",
" Unassigned | \n",
"
\n",
" \n",
" \n",
" \n",
" | Interleukin signaling | \n",
" 20 | \n",
" 41 | \n",
" 25 | \n",
" 285 | \n",
"
\n",
" \n",
" | Tumor necrosis factor | \n",
" 20 | \n",
" 29 | \n",
" 16 | \n",
" 214 | \n",
"
\n",
" \n",
" | miRNA | \n",
" 10 | \n",
" 16 | \n",
" 80 | \n",
" 167 | \n",
"
\n",
" \n",
" | Insulin signal transduction | \n",
" 10 | \n",
" 19 | \n",
" 17 | \n",
" 761 | \n",
"
\n",
" \n",
" | Nuclear factor Kappa beta | \n",
" 8 | \n",
" 29 | \n",
" 19 | \n",
" 60 | \n",
"
\n",
" \n",
" | Inflammatory response | \n",
" 8 | \n",
" 17 | \n",
" 23 | \n",
" 488 | \n",
"
\n",
" \n",
" | Chemokine signaling | \n",
" 6 | \n",
" 12 | \n",
" 4 | \n",
" 168 | \n",
"
\n",
" \n",
" | Neurotrophic | \n",
" 6 | \n",
" 7 | \n",
" 1 | \n",
" 56 | \n",
"
\n",
" \n",
" | Nerve growth factor | \n",
" 6 | \n",
" 10 | \n",
" 5 | \n",
" 171 | \n",
"
\n",
" \n",
" | Caspase | \n",
" 5 | \n",
" 18 | \n",
" 15 | \n",
" 156 | \n",
"
\n",
" \n",
" | DKK1 | \n",
" 5 | \n",
" 5 | \n",
" 8 | \n",
" 77 | \n",
"
\n",
" \n",
" | Albumin | \n",
" 3 | \n",
" 1 | \n",
" 0 | \n",
" 26 | \n",
"
\n",
" \n",
" | Acetylcholine signaling | \n",
" 3 | \n",
" 3 | \n",
" 2 | \n",
" 361 | \n",
"
\n",
" \n",
" | Bcl-2 | \n",
" 3 | \n",
" 11 | \n",
" 5 | \n",
" 77 | \n",
"
\n",
" \n",
" | Complement system | \n",
" 3 | \n",
" 3 | \n",
" 3 | \n",
" 97 | \n",
"
\n",
" \n",
" | Amyloidogenic | \n",
" 3 | \n",
" 27 | \n",
" 62 | \n",
" 1324 | \n",
"
\n",
" \n",
" | Endosomal lysosomal | \n",
" 2 | \n",
" 16 | \n",
" 8 | \n",
" 281 | \n",
"
\n",
" \n",
" | Chaperone | \n",
" 2 | \n",
" 9 | \n",
" 6 | \n",
" 63 | \n",
"
\n",
" \n",
" | Notch signaling | \n",
" 2 | \n",
" 2 | \n",
" 8 | \n",
" 117 | \n",
"
\n",
" \n",
" | Metabolism of steroid hormones | \n",
" 2 | \n",
" 0 | \n",
" 0 | \n",
" 6 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/latex": [
"\\begin{center}{\\begin{tabular}{lrrrr}\n",
"\\toprule\n",
"{} & Correct & Incorrect & Ambiguous & Unassigned \\\\\n",
"\\midrule\n",
"Interleukin signaling & 20 & 41 & 25 & 285 \\\\\n",
"Tumor necrosis factor & 20 & 29 & 16 & 214 \\\\\n",
"miRNA & 10 & 16 & 80 & 167 \\\\\n",
"Insulin signal transduction & 10 & 19 & 17 & 761 \\\\\n",
"Nuclear factor Kappa beta & 8 & 29 & 19 & 60 \\\\\n",
"Inflammatory response & 8 & 17 & 23 & 488 \\\\\n",
"Chemokine signaling & 6 & 12 & 4 & 168 \\\\\n",
"Neurotrophic & 6 & 7 & 1 & 56 \\\\\n",
"Nerve growth factor & 6 & 10 & 5 & 171 \\\\\n",
"Caspase & 5 & 18 & 15 & 156 \\\\\n",
"DKK1 & 5 & 5 & 8 & 77 \\\\\n",
"Albumin & 3 & 1 & 0 & 26 \\\\\n",
"Acetylcholine signaling & 3 & 3 & 2 & 361 \\\\\n",
"Bcl-2 & 3 & 11 & 5 & 77 \\\\\n",
"Complement system & 3 & 3 & 3 & 97 \\\\\n",
"Amyloidogenic & 3 & 27 & 62 & 1324 \\\\\n",
"Endosomal lysosomal & 2 & 16 & 8 & 281 \\\\\n",
"Chaperone & 2 & 9 & 6 & 63 \\\\\n",
"Notch signaling & 2 & 2 & 8 & 117 \\\\\n",
"Metabolism of steroid hormones & 2 & 0 & 0 & 6 \\\\\n",
"\\bottomrule\n",
"\\end{tabular}\n",
"}\\end{center}"
],
"text/plain": [
" Correct Incorrect Ambiguous Unassigned\n",
"Interleukin signaling 20 41 25 285\n",
"Tumor necrosis factor 20 29 16 214\n",
"miRNA 10 16 80 167\n",
"Insulin signal transduction 10 19 17 761\n",
"Nuclear factor Kappa beta 8 29 19 60\n",
"Inflammatory response 8 17 23 488\n",
"Chemokine signaling 6 12 4 168\n",
"Neurotrophic 6 7 1 56\n",
"Nerve growth factor 6 10 5 171\n",
"Caspase 5 18 15 156\n",
"DKK1 5 5 8 77\n",
"Albumin 3 1 0 26\n",
"Acetylcholine signaling 3 3 2 361\n",
"Bcl-2 3 11 5 77\n",
"Complement system 3 3 3 97\n",
"Amyloidogenic 3 27 62 1324\n",
"Endosomal lysosomal 2 16 8 281\n",
"Chaperone 2 9 6 63\n",
"Notch signaling 2 2 8 117\n",
"Metabolism of steroid hormones 2 0 0 6"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"has_correct = concordance_df['Correct'] > 0\n",
"has_incorrect = concordance_df['Incorrect'] > 0\n",
"\n",
"concordance_df[has_correct | has_incorrect].sort_values('Correct', ascending=False).head(20)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The 20 highest concording subgraphs are shown. Even without statistical analysis, these data suggest that high concorance values are most likely due to random chance."
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [],
"source": [
"concordance_df['Concordance'] = concordance_df['Correct'] / (concordance_df['Correct'] + concordance_df['Incorrect'])"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Correct | \n",
" Incorrect | \n",
" Ambiguous | \n",
" Unassigned | \n",
" Concordance | \n",
"
\n",
" \n",
" \n",
" \n",
" | Plasminogen activator | \n",
" 1 | \n",
" 0 | \n",
" 2 | \n",
" 97 | \n",
" 1.000000 | \n",
"
\n",
" \n",
" | Metabolism of steroid hormones | \n",
" 2 | \n",
" 0 | \n",
" 0 | \n",
" 6 | \n",
" 1.000000 | \n",
"
\n",
" \n",
" | Cell cycle | \n",
" 1 | \n",
" 0 | \n",
" 7 | \n",
" 83 | \n",
" 1.000000 | \n",
"
\n",
" \n",
" | Vitamin | \n",
" 2 | \n",
" 0 | \n",
" 0 | \n",
" 41 | \n",
" 1.000000 | \n",
"
\n",
" \n",
" | Albumin | \n",
" 3 | \n",
" 1 | \n",
" 0 | \n",
" 26 | \n",
" 0.750000 | \n",
"
\n",
" \n",
" | Reactive oxygen species | \n",
" 2 | \n",
" 1 | \n",
" 4 | \n",
" 184 | \n",
" 0.666667 | \n",
"
\n",
" \n",
" | Interferon signaling | \n",
" 2 | \n",
" 1 | \n",
" 2 | \n",
" 75 | \n",
" 0.666667 | \n",
"
\n",
" \n",
" | Complement system | \n",
" 3 | \n",
" 3 | \n",
" 3 | \n",
" 97 | \n",
" 0.500000 | \n",
"
\n",
" \n",
" | Notch signaling | \n",
" 2 | \n",
" 2 | \n",
" 8 | \n",
" 117 | \n",
" 0.500000 | \n",
"
\n",
" \n",
" | DKK1 | \n",
" 5 | \n",
" 5 | \n",
" 8 | \n",
" 77 | \n",
" 0.500000 | \n",
"
\n",
" \n",
" | Acetylcholine signaling | \n",
" 3 | \n",
" 3 | \n",
" 2 | \n",
" 361 | \n",
" 0.500000 | \n",
"
\n",
" \n",
" | Neurotrophic | \n",
" 6 | \n",
" 7 | \n",
" 1 | \n",
" 56 | \n",
" 0.461538 | \n",
"
\n",
" \n",
" | Tumor necrosis factor | \n",
" 20 | \n",
" 29 | \n",
" 16 | \n",
" 214 | \n",
" 0.408163 | \n",
"
\n",
" \n",
" | Binding and Uptake of Ligands by Scavenger Receptors | \n",
" 2 | \n",
" 3 | \n",
" 2 | \n",
" 72 | \n",
" 0.400000 | \n",
"
\n",
" \n",
" | miRNA | \n",
" 10 | \n",
" 16 | \n",
" 80 | \n",
" 167 | \n",
" 0.384615 | \n",
"
\n",
" \n",
" | Nerve growth factor | \n",
" 6 | \n",
" 10 | \n",
" 5 | \n",
" 171 | \n",
" 0.375000 | \n",
"
\n",
" \n",
" | Insulin signal transduction | \n",
" 10 | \n",
" 19 | \n",
" 17 | \n",
" 761 | \n",
" 0.344828 | \n",
"
\n",
" \n",
" | Prostaglandin | \n",
" 1 | \n",
" 2 | \n",
" 1 | \n",
" 82 | \n",
" 0.333333 | \n",
"
\n",
" \n",
" | Axonal transport | \n",
" 1 | \n",
" 2 | \n",
" 4 | \n",
" 35 | \n",
" 0.333333 | \n",
"
\n",
" \n",
" | Chemokine signaling | \n",
" 6 | \n",
" 12 | \n",
" 4 | \n",
" 168 | \n",
" 0.333333 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/latex": [
"\\begin{center}{\\begin{tabular}{lrrrrr}\n",
"\\toprule\n",
"{} & Correct & Incorrect & Ambiguous & Unassigned & Concordance \\\\\n",
"\\midrule\n",
"Plasminogen activator & 1 & 0 & 2 & 97 & 1.000000 \\\\\n",
"Metabolism of steroid hormones & 2 & 0 & 0 & 6 & 1.000000 \\\\\n",
"Cell cycle & 1 & 0 & 7 & 83 & 1.000000 \\\\\n",
"Vitamin & 2 & 0 & 0 & 41 & 1.000000 \\\\\n",
"Albumin & 3 & 1 & 0 & 26 & 0.750000 \\\\\n",
"Reactive oxygen species & 2 & 1 & 4 & 184 & 0.666667 \\\\\n",
"Interferon signaling & 2 & 1 & 2 & 75 & 0.666667 \\\\\n",
"Complement system & 3 & 3 & 3 & 97 & 0.500000 \\\\\n",
"Notch signaling & 2 & 2 & 8 & 117 & 0.500000 \\\\\n",
"DKK1 & 5 & 5 & 8 & 77 & 0.500000 \\\\\n",
"Acetylcholine signaling & 3 & 3 & 2 & 361 & 0.500000 \\\\\n",
"Neurotrophic & 6 & 7 & 1 & 56 & 0.461538 \\\\\n",
"Tumor necrosis factor & 20 & 29 & 16 & 214 & 0.408163 \\\\\n",
"Binding and Uptake of Ligands by Scavenger Rece... & 2 & 3 & 2 & 72 & 0.400000 \\\\\n",
"miRNA & 10 & 16 & 80 & 167 & 0.384615 \\\\\n",
"Nerve growth factor & 6 & 10 & 5 & 171 & 0.375000 \\\\\n",
"Insulin signal transduction & 10 & 19 & 17 & 761 & 0.344828 \\\\\n",
"Prostaglandin & 1 & 2 & 1 & 82 & 0.333333 \\\\\n",
"Axonal transport & 1 & 2 & 4 & 35 & 0.333333 \\\\\n",
"Chemokine signaling & 6 & 12 & 4 & 168 & 0.333333 \\\\\n",
"\\bottomrule\n",
"\\end{tabular}\n",
"}\\end{center}"
],
"text/plain": [
" Correct Incorrect \\\n",
"Plasminogen activator 1 0 \n",
"Metabolism of steroid hormones 2 0 \n",
"Cell cycle 1 0 \n",
"Vitamin 2 0 \n",
"Albumin 3 1 \n",
"Reactive oxygen species 2 1 \n",
"Interferon signaling 2 1 \n",
"Complement system 3 3 \n",
"Notch signaling 2 2 \n",
"DKK1 5 5 \n",
"Acetylcholine signaling 3 3 \n",
"Neurotrophic 6 7 \n",
"Tumor necrosis factor 20 29 \n",
"Binding and Uptake of Ligands by Scavenger Rece... 2 3 \n",
"miRNA 10 16 \n",
"Nerve growth factor 6 10 \n",
"Insulin signal transduction 10 19 \n",
"Prostaglandin 1 2 \n",
"Axonal transport 1 2 \n",
"Chemokine signaling 6 12 \n",
"\n",
" Ambiguous Unassigned \\\n",
"Plasminogen activator 2 97 \n",
"Metabolism of steroid hormones 0 6 \n",
"Cell cycle 7 83 \n",
"Vitamin 0 41 \n",
"Albumin 0 26 \n",
"Reactive oxygen species 4 184 \n",
"Interferon signaling 2 75 \n",
"Complement system 3 97 \n",
"Notch signaling 8 117 \n",
"DKK1 8 77 \n",
"Acetylcholine signaling 2 361 \n",
"Neurotrophic 1 56 \n",
"Tumor necrosis factor 16 214 \n",
"Binding and Uptake of Ligands by Scavenger Rece... 2 72 \n",
"miRNA 80 167 \n",
"Nerve growth factor 5 171 \n",
"Insulin signal transduction 17 761 \n",
"Prostaglandin 1 82 \n",
"Axonal transport 4 35 \n",
"Chemokine signaling 4 168 \n",
"\n",
" Concordance \n",
"Plasminogen activator 1.000000 \n",
"Metabolism of steroid hormones 1.000000 \n",
"Cell cycle 1.000000 \n",
"Vitamin 1.000000 \n",
"Albumin 0.750000 \n",
"Reactive oxygen species 0.666667 \n",
"Interferon signaling 0.666667 \n",
"Complement system 0.500000 \n",
"Notch signaling 0.500000 \n",
"DKK1 0.500000 \n",
"Acetylcholine signaling 0.500000 \n",
"Neurotrophic 0.461538 \n",
"Tumor necrosis factor 0.408163 \n",
"Binding and Uptake of Ligands by Scavenger Rece... 0.400000 \n",
"miRNA 0.384615 \n",
"Nerve growth factor 0.375000 \n",
"Insulin signal transduction 0.344828 \n",
"Prostaglandin 0.333333 \n",
"Axonal transport 0.333333 \n",
"Chemokine signaling 0.333333 "
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"concordance_df.sort_values('Concordance', ascending=False).head(20)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"While these calculations were presented in a dataframe, they can be directly acquired with `calculate_concordance_by_annotation`."
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 427 ms, sys: 30.1 ms, total: 458 ms\n",
"Wall time: 466 ms\n"
]
}
],
"source": [
"%%time\n",
"results = calculate_concordance_by_annotation(graph, 'Subgraph', key, cutoff)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The distribution of concordance values over all subgraphs is shown."
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"sns.distplot([\n",
" result\n",
" for result in results.values() \n",
" if result != -1\n",
"])\n",
"\n",
"plt.title('Distribution of concordance values at $C={}$'.format(cutoff))\n",
"plt.xlabel('Concordance')\n",
"plt.ylabel('Frequency')\n",
"plt.xlim([0, 1])\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Conclusions\n",
"\n",
"Varying the threshold from just zero to just above zero to much more stringent reveals varying results. High-dimensional data visualization techniques will need to be used to identify the effect of the threshold, and eventually the stability of concordance values through randomized permutation tests to assess the reliability of this method.\n",
"\n",
"Without further methodological development, this method seems unlikely to make useful assistance in analysis."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.2"
}
},
"nbformat": 4,
"nbformat_minor": 1
}