{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "name": "How_to_use_the_Reactome_BQ_dataset.ipynb", "provenance": [], "collapsed_sections": [], "include_colab_link": true }, "kernelspec": { "name": "python3", "display_name": "Python 3" }, "language_info": { "name": "python" } }, "cells": [ { "cell_type": "markdown", "metadata": { "id": "view-in-github", "colab_type": "text" }, "source": [ "\"Open" ] }, { "cell_type": "markdown", "metadata": { "id": "qgwv4lK5V8Zc" }, "source": [ "# How to use the Reactome BigQuery dataset\n", "Check out other notebooks at our [Community Notebooks Repository](https://github.com/isb-cgc/Community-Notebooks)!\n", "\n", "- **Title:** How to use the Reactome BigQuery dataset\n", "- **Author:** John Phan\n", "- **Created:** 2021-09-13\n", "- **Purpose:** Demonstrate basic usage of the Reactome BigQuery dataset\n", "- **URL:** https://github.com/isb-cgc/Community-Notebooks/blob/master/Notebooks/How_to_use_the_Reactome_BQ_dataset.ipynb\n", "\n", "This notebook demonstrates basic usage of the Reactome BigQuery dataset. Analysis of this dataset can provide a powerful tool for identifying pathways related to cancer biomarkers. \n", "\n", "The Reactome is an open-source, manually curated, and peer-reviewed pathway database. More information can be found here: https://reactome.org/. \n", "\n" ] }, { "cell_type": "markdown", "metadata": { "id": "1zVfEL8i8__O" }, "source": [ "# Initialize Notebook Environment\n", "\n", "Before running the analysis, we need to load dependencies, authenticate to BigQuery, and customize notebook parameters." ] }, { "cell_type": "markdown", "metadata": { "id": "ujUW1wF23d93" }, "source": [ "## Import Dependencies" ] }, { "cell_type": "code", "metadata": { "id": "LNlADvO-gken" }, "source": [ "# GCP Libraries\n", "from google.cloud import bigquery\n", "from google.colab import auth\n", "\n", "# Data Analytics\n", "import numpy as np\n", "from scipy import stats" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "LBHnu8bZDEcE" }, "source": [ "## Authenticate\n", "\n", "Before using BigQuery, we need to get authorization for access to BigQuery and the Google Cloud. For more information see ['Quick Start Guide to ISB-CGC'](https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/HowToGetStartedonISB-CGC.html). Alternative authentication methods can be found [here](https://googleapis.dev/python/google-api-core/latest/auth.html)." ] }, { "cell_type": "code", "metadata": { "id": "M6BEFHj2dd5a" }, "source": [ "# if you're using Google Colab, authenticate to gcloud with the following\n", "auth.authenticate_user()\n", "\n", "# alternatively, use the gcloud SDK\n", "#!gcloud auth application-default login" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "RWuECMsRptWJ" }, "source": [ "## Parameters\n", "\n", "Customize the following parameters based on your notebook, execution environment, or project." ] }, { "cell_type": "code", "metadata": { "id": "veP6gPatu2iW" }, "source": [ "# set the google project that will be billed for this notebook's computations\n", "google_project = 'google-project' ## change me" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "F6zTODbDq3g9" }, "source": [ "## BigQuery Client\n", "\n", "Create the BigQuery client." ] }, { "cell_type": "code", "metadata": { "id": "HNkfSXedwzbC" }, "source": [ "# Create a client to access the data within BigQuery\n", "client = bigquery.Client(google_project)" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "7kVZ88dU85Vs" }, "source": [ "# Identify all Reactome Pathways Related to Genes\n", "\n", "We can join tables from the Reactome BigQuery dataset to identify all pathways related to our genes of interest. We first have to map the gene names to Uniprot IDs in the Reactome \"physical entities\" table. These can then be mapped to Reactome pathways. We further filter the physical entity to pathway evidence codes to retain only interactions that have \"Traceable Author Statements\" (TAS) rather than just \"Inferred from Electronic Annotation\" (IEA) to avoid evidence that have not been manually curated. \n", "\n", "We use the following genes to identify related pathways. These genes were identified in an ovarian cancer chemo-response study by [Bosquet et al](https://molecular-cancer.biomedcentral.com/articles/10.1186/s12943-016-0548-9). " ] }, { "cell_type": "code", "metadata": { "id": "4QDl8QK9dZJf" }, "source": [ "# set parameters for query\n", "genes = \"'RHOT1','MYO7A','ZBTB10','MATK','ST18','RPS23','GCNT1','DROSHA','NUAK1','CCPG1',\\\n", "'PDGFD','KLRAP1','MTAP','RNF13','THBS1','MLX','FAP','TIMP3','PRSS1','SLC7A11',\\\n", "'OLFML3','RPS20','MCM5','POLE','STEAP4','LRRC8D','WBP1L','ENTPD5','SYNE1','DPT',\\\n", "'COPZ2','TRIO','PDPR'\"" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "ZwJGDXql9j6P" }, "source": [ "# run query and put results in data frame\n", "pathways = client.query((\"\"\"\n", " SELECT\n", " DISTINCT pathway.*\n", "\n", " FROM\n", " `isb-cgc-bq.reactome_versioned.pathway_v77` as pathway\n", "\n", " INNER JOIN `isb-cgc-bq.reactome_versioned.pe_to_pathway_v77` as pe2pathway\n", " -- link pathways to physical entities via intermediate table\n", " ON pathway.stable_id = pe2pathway.pathway_stable_id\n", "\n", " INNER JOIN `isb-cgc-bq.reactome_versioned.physical_entity_v77` AS pe\n", " -- link pathways to physical entities\n", " ON pe2pathway.pe_stable_id = pe.stable_id\n", "\n", " WHERE\n", " -- filter by stronger evidence: \"Traceable Author Statement\" \n", " pe2pathway.evidence_code = 'TAS'\n", "\n", " -- filter by pathways that are related to genes in list\n", " AND pe.name IN ({genes}) \n", "\n", " ORDER BY pathway.name ASC\n", "\"\"\").format(\n", " genes=genes\n", ")).result().to_dataframe()" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 419 }, "id": "DmkVEYCHfcWK", "outputId": "dc724070-e5f7-40f5-fce8-b3bba4aec5a4" }, "source": [ "# Display the data frame\n", "pathways" ], "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
stable_idurlnamespecieslowest_level
0R-HSA-176187https://reactome.org/PathwayBrowser/#/R-HSA-17...Activation of ATR in response to replication s...Homo sapiensTrue
1R-HSA-72662https://reactome.org/PathwayBrowser/#/R-HSA-72662Activation of the mRNA upon binding of the cap...Homo sapiensFalse
2R-HSA-68962https://reactome.org/PathwayBrowser/#/R-HSA-68962Activation of the pre-replicative complexHomo sapiensTrue
3R-HSA-352230https://reactome.org/PathwayBrowser/#/R-HSA-35...Amino acid transport across the plasma membraneHomo sapiensTrue
4R-HSA-446203https://reactome.org/PathwayBrowser/#/R-HSA-44...Asparagine N-linked glycosylationHomo sapiensFalse
..................
148R-HSA-192823https://reactome.org/PathwayBrowser/#/R-HSA-19...Viral mRNA TranslationHomo sapiensTrue
149R-HSA-2187338https://reactome.org/PathwayBrowser/#/R-HSA-21...Visual phototransductionHomo sapiensFalse
150R-HSA-193704https://reactome.org/PathwayBrowser/#/R-HSA-19...p75 NTR receptor-mediated signallingHomo sapiensFalse
151R-HSA-72312https://reactome.org/PathwayBrowser/#/R-HSA-72312rRNA processingHomo sapiensFalse
152R-HSA-8868773https://reactome.org/PathwayBrowser/#/R-HSA-88...rRNA processing in the nucleus and cytosolHomo sapiensFalse
\n", "

153 rows × 5 columns

\n", "
" ], "text/plain": [ " stable_id ... lowest_level\n", "0 R-HSA-176187 ... True\n", "1 R-HSA-72662 ... False\n", "2 R-HSA-68962 ... True\n", "3 R-HSA-352230 ... True\n", "4 R-HSA-446203 ... False\n", ".. ... ... ...\n", "148 R-HSA-192823 ... True\n", "149 R-HSA-2187338 ... False\n", "150 R-HSA-193704 ... False\n", "151 R-HSA-72312 ... False\n", "152 R-HSA-8868773 ... False\n", "\n", "[153 rows x 5 columns]" ] }, "metadata": {}, "execution_count": 7 } ] }, { "cell_type": "markdown", "metadata": { "id": "4IT_WO6PO0bH" }, "source": [ "If we're only interested in the lowest level pathways, i.e., pathways that are not parents of other pathways in the hierarchy, we can filter by the \"lowest_level\" field in the pathways table." ] }, { "cell_type": "code", "metadata": { "id": "IQcGozM6OzpW" }, "source": [ "# run query and put results in data frame\n", "lowest_pathways = client.query((\"\"\"\n", " SELECT\n", " DISTINCT pathway.*\n", "\n", " FROM\n", " `isb-cgc-bq.reactome_versioned.pathway_v77` as pathway\n", "\n", " INNER JOIN `isb-cgc-bq.reactome_versioned.pe_to_pathway_v77` as pe2pathway\n", " -- link pathways to physical entities via intermediate table\n", " ON pathway.stable_id = pe2pathway.pathway_stable_id\n", "\n", " INNER JOIN `isb-cgc-bq.reactome_versioned.physical_entity_v77` AS pe\n", " -- link pathways to physical entities\n", " ON pe2pathway.pe_stable_id = pe.stable_id\n", "\n", " WHERE\n", " -- filter by stronger evidence: \"Traceable Author Statement\" \n", " pe2pathway.evidence_code = 'TAS'\n", "\n", " -- filter by pathways that are related to genes in list\n", " AND pe.name IN ({genes})\n", "\n", " -- filter to include just lowest level pathways\n", " AND pathway.lowest_level = TRUE\n", "\n", " ORDER BY pathway.name ASC\n", "\"\"\").format(\n", " genes=genes\n", ")).result().to_dataframe()" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 1000 }, "id": "q5bZg8w6Pn7Q", "outputId": "8096e367-7452-4814-d783-90df5cda9632" }, "source": [ "# display data frame\n", "lowest_pathways" ], "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
stable_idurlnamespecieslowest_level
0R-HSA-176187https://reactome.org/PathwayBrowser/#/R-HSA-17...Activation of ATR in response to replication s...Homo sapiensTrue
1R-HSA-68962https://reactome.org/PathwayBrowser/#/R-HSA-68962Activation of the pre-replicative complexHomo sapiensTrue
2R-HSA-352230https://reactome.org/PathwayBrowser/#/R-HSA-35...Amino acid transport across the plasma membraneHomo sapiensTrue
3R-HSA-210991https://reactome.org/PathwayBrowser/#/R-HSA-21...Basigin interactionsHomo sapiensTrue
4R-HSA-9013148https://reactome.org/PathwayBrowser/#/R-HSA-90...CDC42 GTPase cycleHomo sapiensTrue
5R-HSA-6811434https://reactome.org/PathwayBrowser/#/R-HSA-68...COPI-dependent Golgi-to-ER retrograde trafficHomo sapiensTrue
6R-HSA-6807878https://reactome.org/PathwayBrowser/#/R-HSA-68...COPI-mediated anterograde transportHomo sapiensTrue
7R-HSA-163765https://reactome.org/PathwayBrowser/#/R-HSA-16...ChREBP activates metabolic gene expressionHomo sapiensTrue
8R-HSA-418885https://reactome.org/PathwayBrowser/#/R-HSA-41...DCC mediated attractive signalingHomo sapiensTrue
9R-HSA-68952https://reactome.org/PathwayBrowser/#/R-HSA-68952DNA replication initiationHomo sapiensTrue
10R-HSA-5696400https://reactome.org/PathwayBrowser/#/R-HSA-56...Dual Incision in GG-NERHomo sapiensTrue
11R-HSA-6782135https://reactome.org/PathwayBrowser/#/R-HSA-67...Dual incision in TC-NERHomo sapiensTrue
12R-HSA-72689https://reactome.org/PathwayBrowser/#/R-HSA-72689Formation of a pool of free 40S subunitsHomo sapiensTrue
13R-HSA-72695https://reactome.org/PathwayBrowser/#/R-HSA-72695Formation of the ternary complex, and subseque...Homo sapiensTrue
14R-HSA-416482https://reactome.org/PathwayBrowser/#/R-HSA-41...G alpha (12/13) signalling eventsHomo sapiensTrue
15R-HSA-72706https://reactome.org/PathwayBrowser/#/R-HSA-72706GTP hydrolysis and joining of the 60S ribosoma...Homo sapiensTrue
16R-HSA-5696397https://reactome.org/PathwayBrowser/#/R-HSA-56...Gap-filling DNA repair synthesis and ligation ...Homo sapiensTrue
17R-HSA-6782210https://reactome.org/PathwayBrowser/#/R-HSA-67...Gap-filling DNA repair synthesis and ligation ...Homo sapiensTrue
18R-HSA-8950505https://reactome.org/PathwayBrowser/#/R-HSA-89...Gene and protein expression by JAK-STAT signal...Homo sapiensTrue
19R-HSA-216083https://reactome.org/PathwayBrowser/#/R-HSA-21...Integrin cell surface interactionsHomo sapiensTrue
20R-HSA-156827https://reactome.org/PathwayBrowser/#/R-HSA-15...L13a-mediated translational silencing of Cerul...Homo sapiensTrue
21R-HSA-6791226https://reactome.org/PathwayBrowser/#/R-HSA-67...Major pathway of rRNA processing in the nucleo...Homo sapiensTrue
22R-HSA-1237112https://reactome.org/PathwayBrowser/#/R-HSA-12...Methionine salvage pathwayHomo sapiensTrue
23R-HSA-203927https://reactome.org/PathwayBrowser/#/R-HSA-20...MicroRNA (miRNA) biogenesisHomo sapiensTrue
24R-HSA-5223345https://reactome.org/PathwayBrowser/#/R-HSA-52...Miscellaneous transport and binding eventsHomo sapiensTrue
25R-HSA-193648https://reactome.org/PathwayBrowser/#/R-HSA-19...NRAGE signals death through JNKHomo sapiensTrue
26R-HSA-975957https://reactome.org/PathwayBrowser/#/R-HSA-97...Nonsense Mediated Decay (NMD) enhanced by the ...Homo sapiensTrue
27R-HSA-975956https://reactome.org/PathwayBrowser/#/R-HSA-97...Nonsense Mediated Decay (NMD) independent of t...Homo sapiensTrue
28R-HSA-5173214https://reactome.org/PathwayBrowser/#/R-HSA-51...O-glycosylation of TSR domain-containing proteinsHomo sapiensTrue
29R-HSA-68949https://reactome.org/PathwayBrowser/#/R-HSA-68949Orc1 removal from chromatinHomo sapiensTrue
30R-HSA-5651801https://reactome.org/PathwayBrowser/#/R-HSA-56...PCNA-Dependent Long Patch Base Excision RepairHomo sapiensTrue
31R-HSA-8850843https://reactome.org/PathwayBrowser/#/R-HSA-88...Phosphate bond hydrolysis by NTPDase proteinsHomo sapiensTrue
32R-HSA-114608https://reactome.org/PathwayBrowser/#/R-HSA-11...Platelet degranulationHomo sapiensTrue
33R-HSA-9660826https://reactome.org/PathwayBrowser/#/R-HSA-96...Purinergic signaling in leishmaniasis infectionHomo sapiensTrue
34R-HSA-9013149https://reactome.org/PathwayBrowser/#/R-HSA-90...RAC1 GTPase cycleHomo sapiensTrue
35R-HSA-9013404https://reactome.org/PathwayBrowser/#/R-HSA-90...RAC2 GTPase cycleHomo sapiensTrue
36R-HSA-9013423https://reactome.org/PathwayBrowser/#/R-HSA-90...RAC3 GTPase cycleHomo sapiensTrue
37R-HSA-8980692https://reactome.org/PathwayBrowser/#/R-HSA-89...RHOA GTPase cycleHomo sapiensTrue
38R-HSA-9013408https://reactome.org/PathwayBrowser/#/R-HSA-90...RHOG GTPase cycleHomo sapiensTrue
39R-HSA-9013409https://reactome.org/PathwayBrowser/#/R-HSA-90...RHOJ GTPase cycleHomo sapiensTrue
40R-HSA-9013425https://reactome.org/PathwayBrowser/#/R-HSA-90...RHOT1 GTPase cycleHomo sapiensTrue
41R-HSA-8936459https://reactome.org/PathwayBrowser/#/R-HSA-89...RUNX1 regulates genes involved in megakaryocyt...Homo sapiensTrue
42R-HSA-110314https://reactome.org/PathwayBrowser/#/R-HSA-11...Recognition of DNA damage by PCNA-containing r...Homo sapiensTrue
43R-HSA-6804756https://reactome.org/PathwayBrowser/#/R-HSA-68...Regulation of TP53 Activity through Phosphoryl...Homo sapiensTrue
44R-HSA-204174https://reactome.org/PathwayBrowser/#/R-HSA-20...Regulation of pyruvate dehydrogenase (PDH) com...Homo sapiensTrue
45R-HSA-72702https://reactome.org/PathwayBrowser/#/R-HSA-72702Ribosomal scanning and start codon recognitionHomo sapiensTrue
46R-HSA-1799339https://reactome.org/PathwayBrowser/#/R-HSA-17...SRP-dependent cotranslational protein targetin...Homo sapiensTrue
47R-HSA-5656169https://reactome.org/PathwayBrowser/#/R-HSA-56...Termination of translesion DNA synthesisHomo sapiensTrue
48R-HSA-72649https://reactome.org/PathwayBrowser/#/R-HSA-72649Translation initiation complex formationHomo sapiensTrue
49R-HSA-5689880https://reactome.org/PathwayBrowser/#/R-HSA-56...Ub-specific processing proteasesHomo sapiensTrue
50R-HSA-192823https://reactome.org/PathwayBrowser/#/R-HSA-19...Viral mRNA TranslationHomo sapiensTrue
\n", "
" ], "text/plain": [ " stable_id ... lowest_level\n", "0 R-HSA-176187 ... True\n", "1 R-HSA-68962 ... True\n", "2 R-HSA-352230 ... True\n", "3 R-HSA-210991 ... True\n", "4 R-HSA-9013148 ... True\n", "5 R-HSA-6811434 ... True\n", "6 R-HSA-6807878 ... True\n", "7 R-HSA-163765 ... True\n", "8 R-HSA-418885 ... True\n", "9 R-HSA-68952 ... True\n", "10 R-HSA-5696400 ... True\n", "11 R-HSA-6782135 ... True\n", "12 R-HSA-72689 ... True\n", "13 R-HSA-72695 ... True\n", "14 R-HSA-416482 ... True\n", "15 R-HSA-72706 ... True\n", "16 R-HSA-5696397 ... True\n", "17 R-HSA-6782210 ... True\n", "18 R-HSA-8950505 ... True\n", "19 R-HSA-216083 ... True\n", "20 R-HSA-156827 ... True\n", "21 R-HSA-6791226 ... True\n", "22 R-HSA-1237112 ... True\n", "23 R-HSA-203927 ... True\n", "24 R-HSA-5223345 ... True\n", "25 R-HSA-193648 ... True\n", "26 R-HSA-975957 ... True\n", "27 R-HSA-975956 ... True\n", "28 R-HSA-5173214 ... True\n", "29 R-HSA-68949 ... True\n", "30 R-HSA-5651801 ... True\n", "31 R-HSA-8850843 ... True\n", "32 R-HSA-114608 ... True\n", "33 R-HSA-9660826 ... True\n", "34 R-HSA-9013149 ... True\n", "35 R-HSA-9013404 ... True\n", "36 R-HSA-9013423 ... True\n", "37 R-HSA-8980692 ... True\n", "38 R-HSA-9013408 ... True\n", "39 R-HSA-9013409 ... True\n", "40 R-HSA-9013425 ... True\n", "41 R-HSA-8936459 ... True\n", "42 R-HSA-110314 ... True\n", "43 R-HSA-6804756 ... True\n", "44 R-HSA-204174 ... True\n", "45 R-HSA-72702 ... True\n", "46 R-HSA-1799339 ... True\n", "47 R-HSA-5656169 ... True\n", "48 R-HSA-72649 ... True\n", "49 R-HSA-5689880 ... True\n", "50 R-HSA-192823 ... True\n", "\n", "[51 rows x 5 columns]" ] }, "metadata": {}, "execution_count": 9 } ] }, { "cell_type": "markdown", "metadata": { "id": "ybqSP2aVNE99" }, "source": [ "# Pathway Enrichment Analysis\n", "We can identify pathways that are \"enriched\" with the genes of interest. In other words, we can answer the question: given a set of interesting genes, which pathways contain those genes at a frequency higher than random chance? By calculating the probability that a number of target genes are contained in each pathway, we can identify pathways most likely to be related. \n", "\n", "To do this, we can use a **chi-squared** test to determine if there is a statistically significant difference between the expected frequency of genes in a pathway compared to the observed frequency. \n" ] }, { "cell_type": "code", "metadata": { "id": "sKLy2sxQNIu6" }, "source": [ "# set query parameters\n", "lowest_level = True # only show pathways at the lowest level" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "jDdXWbOHa32V" }, "source": [ "## Construct SQL Query\n", "\n", "A single query can be used to calculate the chi-squared statistic for all pathways. This query is rather lengthy, but can be broken up into a series of named sub-queries. We step through each query below.\n", "\n", "First, we write a query that simply gets a list of all genes in the Reactome physical entity table: " ] }, { "cell_type": "code", "metadata": { "id": "r3NeGvfMbgBp" }, "source": [ "gene_list_query = \"\"\"\n", " -- Table that contains a list of all distinct genes that map to Reactome\n", " -- physical entities \n", " SELECT\n", " DISTINCT pe.uniprot_id\n", "\n", " FROM\n", " `isb-cgc-bq.reactome_versioned.physical_entity_v77` AS pe\n", "\n", " WHERE\n", " -- filter by pathways that are related to genes in list\n", " pe.name IN ({genes})\n", "\"\"\".format(\n", " genes=genes\n", ").strip(\"\\n\")" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "wTVNQFJjgQ1y", "outputId": "b089723a-6c24-4185-9c43-8e34d02f6a55" }, "source": [ "print(gene_list_query)" ], "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ " -- Table that contains a list of all distinct genes that map to Reactome\n", " -- physical entities \n", " SELECT\n", " DISTINCT pe.uniprot_id\n", "\n", " FROM\n", " `isb-cgc-bq.reactome_versioned.physical_entity_v77` AS pe\n", "\n", " WHERE\n", " -- filter by pathways that are related to genes in list\n", " pe.name IN ('RHOT1','MYO7A','ZBTB10','MATK','ST18','RPS23','GCNT1','DROSHA','NUAK1','CCPG1','PDGFD','KLRAP1','MTAP','RNF13','THBS1','MLX','FAP','TIMP3','PRSS1','SLC7A11','OLFML3','RPS20','MCM5','POLE','STEAP4','LRRC8D','WBP1L','ENTPD5','SYNE1','DPT','COPZ2','TRIO','PDPR')\n" ] } ] }, { "cell_type": "markdown", "metadata": { "id": "bY4QCtDfgwBq" }, "source": [ "We then create a query that counts all targets associated with each Reactome pathway. This query depends on the previous `gene_list_query` sub-query. " ] }, { "cell_type": "code", "metadata": { "id": "H_3xAQdjhJ9s" }, "source": [ "gene_pp_query = \"\"\"\n", " -- Table that maps pathways to the total number of interesting genes within\n", " -- that pathway\n", " SELECT\n", " COUNT(DISTINCT gene_list_query.uniprot_id) as num_genes,\n", " pathway.stable_id,\n", " pathway.name\n", "\n", " FROM\n", " gene_list_query\n", "\n", " INNER JOIN `isb-cgc-bq.reactome_versioned.physical_entity_v77` AS pe\n", " -- filter for interactions with genes that match a reactome\n", " -- physical entity\n", " ON gene_list_query.uniprot_id = pe.uniprot_id\n", "\n", " INNER JOIN `isb-cgc-bq.reactome_versioned.pe_to_pathway_v77` AS pe2pathway\n", " -- link physical entities to pathways via intermediate table\n", " ON pe.stable_id = pe2pathway.pe_stable_id\n", "\n", " INNER JOIN `isb-cgc-bq.reactome_versioned.pathway_v77` AS pathway\n", " -- link physical entities to pathways\n", " ON pe2pathway.pathway_stable_id = pathway.stable_id\n", "\n", " WHERE\n", " -- filter by stronger evidence: \"Traceable Author Statement\" \n", " pe2pathway.evidence_code = 'TAS'\n", "\n", " GROUP BY pathway.stable_id, pathway.name\n", " ORDER BY num_genes DESC\n", "\"\"\".strip(\"\\n\")" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "v29oHYEukrzy", "outputId": "975b1f58-969e-4676-d5b7-45ab0b78694c" }, "source": [ "print(gene_pp_query)" ], "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ " -- Table that maps pathways to the total number of interesting genes within\n", " -- that pathway\n", " SELECT\n", " COUNT(DISTINCT gene_list_query.uniprot_id) as num_genes,\n", " pathway.stable_id,\n", " pathway.name\n", "\n", " FROM\n", " gene_list_query\n", "\n", " INNER JOIN `isb-cgc-bq.reactome_versioned.physical_entity_v77` AS pe\n", " -- filter for interactions with genes that match a reactome\n", " -- physical entity\n", " ON gene_list_query.uniprot_id = pe.uniprot_id\n", "\n", " INNER JOIN `isb-cgc-bq.reactome_versioned.pe_to_pathway_v77` AS pe2pathway\n", " -- link physical entities to pathways via intermediate table\n", " ON pe.stable_id = pe2pathway.pe_stable_id\n", "\n", " INNER JOIN `isb-cgc-bq.reactome_versioned.pathway_v77` AS pathway\n", " -- link physical entities to pathways\n", " ON pe2pathway.pathway_stable_id = pathway.stable_id\n", "\n", " WHERE\n", " -- filter by stronger evidence: \"Traceable Author Statement\" \n", " pe2pathway.evidence_code = 'TAS'\n", "\n", " GROUP BY pathway.stable_id, pathway.name\n", " ORDER BY num_genes DESC\n" ] } ] }, { "cell_type": "markdown", "metadata": { "id": "_XXiAJZUlBxC" }, "source": [ "Now we construct the same queries for genes that are NOT in the interesting gene list. These are prefixed with \"not_gene\". This query depends on the previous `gene_list_query` sub-query. " ] }, { "cell_type": "code", "metadata": { "id": "1KyR-_WqlVVn" }, "source": [ "not_gene_list_query = \"\"\"\n", " -- Table that contains a list of all genes that are NOT in the interest list\n", " -- This query depends on the previous \"gene_list_query\" sub-query.\n", " SELECT\n", " DISTINCT pe.uniprot_id\n", "\n", " FROM `isb-cgc-bq.reactome_versioned.physical_entity_v77` AS pe\n", "\n", " WHERE\n", " pe.uniprot_id NOT IN (\n", " SELECT uniprot_id FROM gene_list_query\n", " )\n", "\"\"\".strip(\"\\n\")" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "92MaTzsAn3es", "outputId": "c50ae34a-1cbf-496e-86b4-ab804bc9c581" }, "source": [ "print(not_gene_list_query)" ], "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ " -- Table that contains a list of all genes that are NOT in the interest list\n", " -- This query depends on the previous \"gene_list_query\" sub-query.\n", " SELECT\n", " DISTINCT pe.uniprot_id\n", "\n", " FROM `isb-cgc-bq.reactome_versioned.physical_entity_v77` AS pe\n", "\n", " WHERE\n", " pe.uniprot_id NOT IN (\n", " SELECT uniprot_id FROM gene_list_query\n", " )\n" ] } ] }, { "cell_type": "code", "metadata": { "id": "mI1SxkGtn-ow" }, "source": [ "not_gene_pp_query = \"\"\"\n", " -- Table that maps pathways to the number of proteins that are NOT drug\n", " -- targets in that pathway.\n", " SELECT\n", " COUNT(DISTINCT not_gene_list_query.uniprot_id) AS num_not_genes,\n", " pathway.stable_id,\n", " pathway.name\n", "\n", " FROM not_gene_list_query\n", "\n", " INNER JOIN `isb-cgc-bq.reactome_versioned.physical_entity_v77` AS pe\n", " ON not_gene_list_query.uniprot_id = pe.uniprot_id\n", "\n", " INNER JOIN `isb-cgc-bq.reactome_versioned.pe_to_pathway_v77` AS pe2pathway\n", " ON pe.stable_id = pe2pathway.pe_stable_id\n", " \n", " INNER JOIN `isb-cgc-bq.reactome_versioned.pathway_v77` AS pathway\n", " ON pe2pathway.pathway_stable_id = pathway.stable_id\n", "\n", " WHERE\n", " -- filter by stronger evidence: \"Traceable Author Statement\" \n", " pe2pathway.evidence_code = 'TAS'\n", "\n", " GROUP BY pathway.stable_id, pathway.name\n", " ORDER BY num_not_genes DESC\n", "\"\"\".strip(\"\\n\")" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "TgY33Z3ukByC", "outputId": "e9143dce-28de-4f43-ce7a-032c79757486" }, "source": [ "print(not_gene_pp_query)" ], "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ " -- Table that maps pathways to the number of proteins that are NOT drug\n", " -- targets in that pathway.\n", " SELECT\n", " COUNT(DISTINCT not_gene_list_query.uniprot_id) AS num_not_genes,\n", " pathway.stable_id,\n", " pathway.name\n", "\n", " FROM not_gene_list_query\n", "\n", " INNER JOIN `isb-cgc-bq.reactome_versioned.physical_entity_v77` AS pe\n", " ON not_gene_list_query.uniprot_id = pe.uniprot_id\n", "\n", " INNER JOIN `isb-cgc-bq.reactome_versioned.pe_to_pathway_v77` AS pe2pathway\n", " ON pe.stable_id = pe2pathway.pe_stable_id\n", " \n", " INNER JOIN `isb-cgc-bq.reactome_versioned.pathway_v77` AS pathway\n", " ON pe2pathway.pathway_stable_id = pathway.stable_id\n", "\n", " WHERE\n", " -- filter by stronger evidence: \"Traceable Author Statement\" \n", " pe2pathway.evidence_code = 'TAS'\n", "\n", " GROUP BY pathway.stable_id, pathway.name\n", " ORDER BY num_not_genes DESC\n" ] } ] }, { "cell_type": "markdown", "metadata": { "id": "Z2eN5_UfVgcX" }, "source": [ "For convenience, we create a sub-query to get the counts of the number of genes that are in the list and the number of genes that are not in the list." ] }, { "cell_type": "code", "metadata": { "id": "VAYDIHsdWLFA" }, "source": [ "gene_count_query = \"\"\"\n", " -- Table that contains the counts of # of genes that are/are not targets\n", " SELECT\n", " gene_count,\n", " not_gene_count,\n", " gene_count + not_gene_count AS total_count\n", "\n", " FROM \n", " (SELECT COUNT(*) AS gene_count FROM gene_list_query),\n", " (SELECT COUNT(*) AS not_gene_count FROM not_gene_list_query)\n", "\"\"\".strip(\"\\n\")" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "qrt8DcUraJm3" }, "source": [ "Now we can create more interesting queries for the contingency matrices that contain the observed and expected values. \n", "\n", "First, the observed contingency table counts for each pathway:" ] }, { "cell_type": "code", "metadata": { "id": "8Dc-wMSZakad" }, "source": [ "observed_query = \"\"\"\n", " -- Table with observed values per pathway in the contingency matrix\n", " SELECT\n", " gene_pp_query.num_genes AS in_gene_in_pathway,\n", " not_gene_pp_query.num_not_genes AS not_gene_in_pathway,\n", " gene_count_query.gene_count - gene_pp_query.num_genes AS in_gene_not_pathway,\n", " gene_count_query.not_gene_count - not_gene_pp_query.num_not_genes AS not_gene_not_pathway,\n", " gene_pp_query.stable_id,\n", " gene_pp_query.name\n", "\n", " FROM \n", " gene_pp_query,\n", " gene_count_query\n", "\n", " INNER JOIN not_gene_pp_query\n", " ON gene_pp_query.stable_id = not_gene_pp_query.stable_id\n", "\"\"\".strip(\"\\n\")" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "STbl3Toxcq7B" }, "source": [ "Then the observed row and column sums of the contingency table:" ] }, { "cell_type": "code", "metadata": { "id": "4kXNQ4SRcmPK" }, "source": [ "sum_query = \"\"\"\n", " -- Table with summed observed values per pathway in the contingency matrix\n", " SELECT\n", " observed_query.in_gene_in_pathway + observed_query.not_gene_in_pathway AS pathway_total,\n", " observed_query.in_gene_not_pathway + observed_query.not_gene_not_pathway AS not_pathway_total,\n", " observed_query.in_gene_in_pathway + observed_query.in_gene_not_pathway AS gene_total,\n", " observed_query.not_gene_in_pathway + observed_query.not_gene_not_pathway AS not_gene_total,\n", " observed_query.stable_id,\n", " observed_query.name\n", "\n", " FROM\n", " observed_query\n", "\"\"\".strip(\"\\n\")" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "6HiQlxAudQEk" }, "source": [ "And the expected contingency table values for each pathway:" ] }, { "cell_type": "code", "metadata": { "id": "V0Rj098QdWe6" }, "source": [ "expected_query = \"\"\" \n", " -- Table with the expected values per pathway in the contingency matrix\n", " SELECT \n", " sum_query.gene_total * sum_query.pathway_total / gene_count_query.total_count AS exp_in_gene_in_pathway,\n", " sum_query.not_gene_total * sum_query.pathway_total / gene_count_query.total_count AS exp_not_gene_in_pathway,\n", " sum_query.gene_total * sum_query.not_pathway_total / gene_count_query.total_count AS exp_in_gene_not_pathway,\n", " sum_query.not_gene_total * sum_query.not_pathway_total / gene_count_query.total_count AS exp_not_gene_not_pathway,\n", " sum_query.stable_id,\n", " sum_query.name\n", "\n", " FROM \n", " sum_query, gene_count_query\n", "\"\"\".strip(\"\\n\")" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "wC57CDA6d_4H" }, "source": [ "Finally, we can calculate the chi-squared statistic for each pathway:" ] }, { "cell_type": "code", "metadata": { "id": "kxlrSyiTd_Ia" }, "source": [ "chi_squared_query = \"\"\"\n", " -- Table with the chi-squared statistic for each pathway\n", " SELECT\n", " -- Chi squared statistic with Yates' correction\n", " POW(ABS(observed_query.in_gene_in_pathway - expected_query.exp_in_gene_in_pathway) - 0.5, 2) / expected_query.exp_in_gene_in_pathway \n", " + POW(ABS(observed_query.not_gene_in_pathway - expected_query.exp_not_gene_in_pathway) - 0.5, 2) / expected_query.exp_not_gene_in_pathway\n", " + POW(ABS(observed_query.in_gene_not_pathway - expected_query.exp_in_gene_not_pathway) - 0.5, 2) / expected_query.exp_in_gene_not_pathway\n", " + POW(ABS(observed_query.not_gene_not_pathway - expected_query.exp_not_gene_not_pathway) - 0.5, 2) / expected_query.exp_not_gene_not_pathway\n", " AS chi_squared_stat,\n", " observed_query.stable_id,\n", " observed_query.name\n", "\n", " FROM observed_query\n", "\n", " INNER JOIN expected_query\n", " ON observed_query.stable_id = expected_query.stable_id\n", "\"\"\".strip(\"\\n\")" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "lTRfNbLKuarY" }, "source": [ "The final piece of the query optionally adds a filter that removes all pathways that are not at the lowest level of the hierarchy. This helps remove non-specific \"pathways\" such as the generic \"Disease\" pathway." ] }, { "cell_type": "code", "metadata": { "id": "IBpHKWxjvCSx" }, "source": [ "lowest_level_filter = \"\"\"\n", " INNER JOIN `isb-cgc-bq.reactome_versioned.pathway_v77` AS pathway\n", " ON chi_squared_query.stable_id = pathway.stable_id\n", "\n", " WHERE pathway.lowest_level = TRUE\n", "\"\"\".strip(\"\\n\")" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "V_Qx1J2IvrBT", "outputId": "04ee2a9a-a1df-416f-fef6-83cb0bc23671" }, "source": [ "print(lowest_level_filter)" ], "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ " INNER JOIN `isb-cgc-bq.reactome_versioned.pathway_v77` AS pathway\n", " ON chi_squared_query.stable_id = pathway.stable_id\n", "\n", " WHERE pathway.lowest_level = TRUE\n" ] } ] }, { "cell_type": "markdown", "metadata": { "id": "QsJQyREafSZE" }, "source": [ "Now we can combine all sub-queries to create the final query:" ] }, { "cell_type": "code", "metadata": { "id": "dOXXXAmkfcnb" }, "source": [ "final_query = \"\"\"\n", " WITH\n", " gene_list_query AS (\n", " {gene_list_query}\n", " ),\n", "\n", " gene_pp_query AS (\n", " {gene_pp_query}\n", " ),\n", "\n", " not_gene_list_query AS (\n", " {not_gene_list_query}\n", " ),\n", "\n", " not_gene_pp_query AS (\n", " {not_gene_pp_query}\n", " ),\n", "\n", " gene_count_query AS (\n", " {gene_count_query}\n", " ),\n", "\n", " observed_query AS (\n", " {observed_query}\n", " ),\n", "\n", " sum_query AS (\n", " {sum_query}\n", " ),\n", "\n", " expected_query AS (\n", " {expected_query}\n", " ),\n", " \n", " chi_squared_query AS (\n", " {chi_squared_query}\n", " )\n", "\n", " SELECT\n", " observed_query.in_gene_in_pathway,\n", " observed_query.in_gene_not_pathway,\n", " observed_query.not_gene_in_pathway,\n", " observed_query.not_gene_not_pathway,\n", " chi_squared_query.chi_squared_stat,\n", " chi_squared_query.stable_id,\n", " chi_squared_query.name\n", "\n", " FROM chi_squared_query\n", "\n", " INNER JOIN observed_query\n", " ON chi_squared_query.stable_id = observed_query.stable_id\n", " {lowest_level_filter}\n", " ORDER BY chi_squared_stat DESC\n", "\"\"\".format(\n", " # make final query a little easier to read by removing/adding some white space \n", " gene_list_query=\"\\n \".join(gene_list_query.strip().splitlines()),\n", " gene_pp_query=\"\\n \".join(gene_pp_query.strip().splitlines()),\n", " not_gene_list_query=\"\\n \".join(not_gene_list_query.strip().splitlines()),\n", " not_gene_pp_query=\"\\n \".join(not_gene_pp_query.strip().splitlines()),\n", " gene_count_query=\"\\n \".join(gene_count_query.strip().splitlines()),\n", " observed_query=\"\\n \".join(observed_query.strip().splitlines()),\n", " sum_query=\"\\n \".join(sum_query.strip().splitlines()),\n", " expected_query=\"\\n \".join(expected_query.strip().splitlines()),\n", " chi_squared_query=\"\\n \".join(chi_squared_query.strip().splitlines()),\n", " lowest_level_filter=(\n", " \"\\n \"+\"\\n\".join(lowest_level_filter.strip().splitlines())+\"\\n\" if lowest_level else \"\"\n", " )\n", ").strip(\"\\n\")" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "YEjfpM9Q4xsz" }, "source": [ "## Display Final Query" ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "MfitLyzmgq_P", "outputId": "b554ac13-3c67-4639-9993-e1df53834ee5" }, "source": [ "# print the formatted final query\n", "print(final_query)" ], "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ " WITH\n", " gene_list_query AS (\n", " -- Table that contains a list of all distinct genes that map to Reactome\n", " -- physical entities \n", " SELECT\n", " DISTINCT pe.uniprot_id\n", " \n", " FROM\n", " `isb-cgc-bq.reactome_versioned.physical_entity_v77` AS pe\n", " \n", " WHERE\n", " -- filter by pathways that are related to genes in list\n", " pe.name IN ('RHOT1','MYO7A','ZBTB10','MATK','ST18','RPS23','GCNT1','DROSHA','NUAK1','CCPG1','PDGFD','KLRAP1','MTAP','RNF13','THBS1','MLX','FAP','TIMP3','PRSS1','SLC7A11','OLFML3','RPS20','MCM5','POLE','STEAP4','LRRC8D','WBP1L','ENTPD5','SYNE1','DPT','COPZ2','TRIO','PDPR')\n", " ),\n", "\n", " gene_pp_query AS (\n", " -- Table that maps pathways to the total number of interesting genes within\n", " -- that pathway\n", " SELECT\n", " COUNT(DISTINCT gene_list_query.uniprot_id) as num_genes,\n", " pathway.stable_id,\n", " pathway.name\n", " \n", " FROM\n", " gene_list_query\n", " \n", " INNER JOIN `isb-cgc-bq.reactome_versioned.physical_entity_v77` AS pe\n", " -- filter for interactions with genes that match a reactome\n", " -- physical entity\n", " ON gene_list_query.uniprot_id = pe.uniprot_id\n", " \n", " INNER JOIN `isb-cgc-bq.reactome_versioned.pe_to_pathway_v77` AS pe2pathway\n", " -- link physical entities to pathways via intermediate table\n", " ON pe.stable_id = pe2pathway.pe_stable_id\n", " \n", " INNER JOIN `isb-cgc-bq.reactome_versioned.pathway_v77` AS pathway\n", " -- link physical entities to pathways\n", " ON pe2pathway.pathway_stable_id = pathway.stable_id\n", " \n", " WHERE\n", " -- filter by stronger evidence: \"Traceable Author Statement\" \n", " pe2pathway.evidence_code = 'TAS'\n", " \n", " GROUP BY pathway.stable_id, pathway.name\n", " ORDER BY num_genes DESC\n", " ),\n", "\n", " not_gene_list_query AS (\n", " -- Table that contains a list of all genes that are NOT in the interest list\n", " -- This query depends on the previous \"gene_list_query\" sub-query.\n", " SELECT\n", " DISTINCT pe.uniprot_id\n", " \n", " FROM `isb-cgc-bq.reactome_versioned.physical_entity_v77` AS pe\n", " \n", " WHERE\n", " pe.uniprot_id NOT IN (\n", " SELECT uniprot_id FROM gene_list_query\n", " )\n", " ),\n", "\n", " not_gene_pp_query AS (\n", " -- Table that maps pathways to the number of proteins that are NOT drug\n", " -- targets in that pathway.\n", " SELECT\n", " COUNT(DISTINCT not_gene_list_query.uniprot_id) AS num_not_genes,\n", " pathway.stable_id,\n", " pathway.name\n", " \n", " FROM not_gene_list_query\n", " \n", " INNER JOIN `isb-cgc-bq.reactome_versioned.physical_entity_v77` AS pe\n", " ON not_gene_list_query.uniprot_id = pe.uniprot_id\n", " \n", " INNER JOIN `isb-cgc-bq.reactome_versioned.pe_to_pathway_v77` AS pe2pathway\n", " ON pe.stable_id = pe2pathway.pe_stable_id\n", " \n", " INNER JOIN `isb-cgc-bq.reactome_versioned.pathway_v77` AS pathway\n", " ON pe2pathway.pathway_stable_id = pathway.stable_id\n", " \n", " WHERE\n", " -- filter by stronger evidence: \"Traceable Author Statement\" \n", " pe2pathway.evidence_code = 'TAS'\n", " \n", " GROUP BY pathway.stable_id, pathway.name\n", " ORDER BY num_not_genes DESC\n", " ),\n", "\n", " gene_count_query AS (\n", " -- Table that contains the counts of # of genes that are/are not targets\n", " SELECT\n", " gene_count,\n", " not_gene_count,\n", " gene_count + not_gene_count AS total_count\n", " \n", " FROM \n", " (SELECT COUNT(*) AS gene_count FROM gene_list_query),\n", " (SELECT COUNT(*) AS not_gene_count FROM not_gene_list_query)\n", " ),\n", "\n", " observed_query AS (\n", " -- Table with observed values per pathway in the contingency matrix\n", " SELECT\n", " gene_pp_query.num_genes AS in_gene_in_pathway,\n", " not_gene_pp_query.num_not_genes AS not_gene_in_pathway,\n", " gene_count_query.gene_count - gene_pp_query.num_genes AS in_gene_not_pathway,\n", " gene_count_query.not_gene_count - not_gene_pp_query.num_not_genes AS not_gene_not_pathway,\n", " gene_pp_query.stable_id,\n", " gene_pp_query.name\n", " \n", " FROM \n", " gene_pp_query,\n", " gene_count_query\n", " \n", " INNER JOIN not_gene_pp_query\n", " ON gene_pp_query.stable_id = not_gene_pp_query.stable_id\n", " ),\n", "\n", " sum_query AS (\n", " -- Table with summed observed values per pathway in the contingency matrix\n", " SELECT\n", " observed_query.in_gene_in_pathway + observed_query.not_gene_in_pathway AS pathway_total,\n", " observed_query.in_gene_not_pathway + observed_query.not_gene_not_pathway AS not_pathway_total,\n", " observed_query.in_gene_in_pathway + observed_query.in_gene_not_pathway AS gene_total,\n", " observed_query.not_gene_in_pathway + observed_query.not_gene_not_pathway AS not_gene_total,\n", " observed_query.stable_id,\n", " observed_query.name\n", " \n", " FROM\n", " observed_query\n", " ),\n", "\n", " expected_query AS (\n", " -- Table with the expected values per pathway in the contingency matrix\n", " SELECT \n", " sum_query.gene_total * sum_query.pathway_total / gene_count_query.total_count AS exp_in_gene_in_pathway,\n", " sum_query.not_gene_total * sum_query.pathway_total / gene_count_query.total_count AS exp_not_gene_in_pathway,\n", " sum_query.gene_total * sum_query.not_pathway_total / gene_count_query.total_count AS exp_in_gene_not_pathway,\n", " sum_query.not_gene_total * sum_query.not_pathway_total / gene_count_query.total_count AS exp_not_gene_not_pathway,\n", " sum_query.stable_id,\n", " sum_query.name\n", " \n", " FROM \n", " sum_query, gene_count_query\n", " ),\n", " \n", " chi_squared_query AS (\n", " -- Table with the chi-squared statistic for each pathway\n", " SELECT\n", " -- Chi squared statistic with Yates' correction\n", " POW(ABS(observed_query.in_gene_in_pathway - expected_query.exp_in_gene_in_pathway) - 0.5, 2) / expected_query.exp_in_gene_in_pathway \n", " + POW(ABS(observed_query.not_gene_in_pathway - expected_query.exp_not_gene_in_pathway) - 0.5, 2) / expected_query.exp_not_gene_in_pathway\n", " + POW(ABS(observed_query.in_gene_not_pathway - expected_query.exp_in_gene_not_pathway) - 0.5, 2) / expected_query.exp_in_gene_not_pathway\n", " + POW(ABS(observed_query.not_gene_not_pathway - expected_query.exp_not_gene_not_pathway) - 0.5, 2) / expected_query.exp_not_gene_not_pathway\n", " AS chi_squared_stat,\n", " observed_query.stable_id,\n", " observed_query.name\n", " \n", " FROM observed_query\n", " \n", " INNER JOIN expected_query\n", " ON observed_query.stable_id = expected_query.stable_id\n", " )\n", "\n", " SELECT\n", " observed_query.in_gene_in_pathway,\n", " observed_query.in_gene_not_pathway,\n", " observed_query.not_gene_in_pathway,\n", " observed_query.not_gene_not_pathway,\n", " chi_squared_query.chi_squared_stat,\n", " chi_squared_query.stable_id,\n", " chi_squared_query.name\n", "\n", " FROM chi_squared_query\n", "\n", " INNER JOIN observed_query\n", " ON chi_squared_query.stable_id = observed_query.stable_id\n", " \n", " INNER JOIN `isb-cgc-bq.reactome_versioned.pathway_v77` AS pathway\n", " ON chi_squared_query.stable_id = pathway.stable_id\n", "\n", " WHERE pathway.lowest_level = TRUE\n", "\n", " ORDER BY chi_squared_stat DESC\n" ] } ] }, { "cell_type": "markdown", "metadata": { "id": "Fjkbq4FR19in" }, "source": [ "## Execute the Query\n", "\n", "Now execute the query to calculate a chi-squared statistic for each pathway:" ] }, { "cell_type": "code", "metadata": { "id": "Ua5oYRFY1aVu" }, "source": [ "# run query and put results in data frame\n", "chi_squared_pathways = client.query(final_query).result().to_dataframe()" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "9Yt-GatY2awc", "colab": { "base_uri": "https://localhost:8080/", "height": 1000 }, "outputId": "3c62ac76-7a84-4ae3-89a2-9ea7e35f2737" }, "source": [ "# display the data frame\n", "chi_squared_pathways" ], "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
in_gene_in_pathwayin_gene_not_pathwaynot_gene_in_pathwaynot_gene_not_pathwaychi_squared_statstable_idname
0219311111433.477023R-HSA-68962Activation of the pre-replicative complex
112031114232.311989R-HSA-1237112Methionine salvage pathway
212031114232.311989R-HSA-9013425RHOT1 GTPase cycle
3219501109520.236790R-HSA-72695Formation of the ternary complex, and subseque...
412061113918.048197R-HSA-163765ChREBP activates metabolic gene expression
5219571108817.513505R-HSA-72649Translation initiation complex formation
6219571108817.513505R-HSA-72702Ribosomal scanning and start codon recognition
712071113815.671798R-HSA-8850843Phosphate bond hydrolysis by NTPDase proteins
812071113815.671798R-HSA-68952DNA replication initiation
9120101113511.137000R-HSA-418885DCC mediated attractive signaling
10219881105710.567006R-HSA-192823Viral mRNA Translation
11219921105310.006894R-HSA-1799339SRP-dependent cotranslational protein targetin...
1221994110519.744550R-HSA-975956Nonsense Mediated Decay (NMD) independent of t...
13219100110459.020030R-HSA-72689Formation of a pool of free 40S subunits
14219110110357.987388R-HSA-156827L13a-mediated translational silencing of Cerul...
15219111110347.894339R-HSA-72706GTP hydrolysis and joining of the 60S ribosoma...
16219115110307.538335R-HSA-975957Nonsense Mediated Decay (NMD) enhanced by the ...
1712015111307.362504R-HSA-204174Regulation of pyruvate dehydrogenase (PDH) com...
18219119110267.206312R-HSA-114608Platelet degranulation
1912020111255.389681R-HSA-5651801PCNA-Dependent Long Patch Base Excision Repair
2012023111224.602357R-HSA-203927MicroRNA (miRNA) biogenesis
2112024111214.382195R-HSA-9660826Purinergic signaling in leishmaniasis infection
2212024111214.382195R-HSA-210991Basigin interactions
2312024111214.382195R-HSA-5223345Miscellaneous transport and binding events
2412024111214.382195R-HSA-5696397Gap-filling DNA repair synthesis and ligation ...
25219181109643.953753R-HSA-6791226Major pathway of rRNA processing in the nucleo...
2612029111163.503357R-HSA-110314Recognition of DNA damage by PCNA-containing r...
2712031111143.229512R-HSA-5656169Termination of translesion DNA synthesis
2812033111122.988310R-HSA-352230Amino acid transport across the plasma membrane
2912036111092.676134R-HSA-176187Activation of ATR in response to replication s...
3012036111092.676134R-HSA-8950505Gene and protein expression by JAK-STAT signal...
3112037111082.583220R-HSA-5083635Defective B3GALTL causes Peters-plus syndrome ...
3212038111072.495164R-HSA-5173214O-glycosylation of TSR domain-containing proteins
3312040111052.332203R-HSA-5696400Dual Incision in GG-NER
3412054110911.530727R-HSA-9013409RHOJ GTPase cycle
3512054110911.530727R-HSA-193648NRAGE signals death through JNK
3612063110821.206546R-HSA-6782210Gap-filling DNA repair synthesis and ligation ...
3712064110811.176348R-HSA-6782135Dual incision in TC-NER
3812065110801.147121R-HSA-8936459RUNX1 regulates genes involved in megakaryocyt...
3912070110751.014107R-HSA-68949Orc1 removal from chromatin
4012073110720.943520R-HSA-9013408RHOG GTPase cycle
4112079110660.819465R-HSA-416482G alpha (12/13) signalling events
4212080110650.800735R-HSA-216083Integrin cell surface interactions
4312087110580.682709R-HSA-9013404RAC2 GTPase cycle
4412091110540.624208R-HSA-6804756Regulation of TP53 Activity through Phosphoryl...
4512093110520.597060R-HSA-9013423RAC3 GTPase cycle
4612099110460.523015R-HSA-6811434COPI-dependent Golgi-to-ER retrograde traffic
47120101110440.500559R-HSA-6807878COPI-mediated anterograde transport
48120149109960.170910R-HSA-8980692RHOA GTPase cycle
49120158109870.137271R-HSA-9013148CDC42 GTPase cycle
50120184109610.067711R-HSA-9013149RAC1 GTPase cycle
51120204109410.034678R-HSA-5689880Ub-specific processing proteases
\n", "
" ], "text/plain": [ " in_gene_in_pathway ... name\n", "0 2 ... Activation of the pre-replicative complex\n", "1 1 ... Methionine salvage pathway\n", "2 1 ... RHOT1 GTPase cycle\n", "3 2 ... Formation of the ternary complex, and subseque...\n", "4 1 ... ChREBP activates metabolic gene expression\n", "5 2 ... Translation initiation complex formation\n", "6 2 ... Ribosomal scanning and start codon recognition\n", "7 1 ... Phosphate bond hydrolysis by NTPDase proteins\n", "8 1 ... DNA replication initiation\n", "9 1 ... DCC mediated attractive signaling\n", "10 2 ... Viral mRNA Translation\n", "11 2 ... SRP-dependent cotranslational protein targetin...\n", "12 2 ... Nonsense Mediated Decay (NMD) independent of t...\n", "13 2 ... Formation of a pool of free 40S subunits\n", "14 2 ... L13a-mediated translational silencing of Cerul...\n", "15 2 ... GTP hydrolysis and joining of the 60S ribosoma...\n", "16 2 ... Nonsense Mediated Decay (NMD) enhanced by the ...\n", "17 1 ... Regulation of pyruvate dehydrogenase (PDH) com...\n", "18 2 ... Platelet degranulation \n", "19 1 ... PCNA-Dependent Long Patch Base Excision Repair\n", "20 1 ... MicroRNA (miRNA) biogenesis\n", "21 1 ... Purinergic signaling in leishmaniasis infection\n", "22 1 ... Basigin interactions\n", "23 1 ... Miscellaneous transport and binding events\n", "24 1 ... Gap-filling DNA repair synthesis and ligation ...\n", "25 2 ... Major pathway of rRNA processing in the nucleo...\n", "26 1 ... Recognition of DNA damage by PCNA-containing r...\n", "27 1 ... Termination of translesion DNA synthesis\n", "28 1 ... Amino acid transport across the plasma membrane\n", "29 1 ... Activation of ATR in response to replication s...\n", "30 1 ... Gene and protein expression by JAK-STAT signal...\n", "31 1 ... Defective B3GALTL causes Peters-plus syndrome ...\n", "32 1 ... O-glycosylation of TSR domain-containing proteins\n", "33 1 ... Dual Incision in GG-NER\n", "34 1 ... RHOJ GTPase cycle\n", "35 1 ... NRAGE signals death through JNK\n", "36 1 ... Gap-filling DNA repair synthesis and ligation ...\n", "37 1 ... Dual incision in TC-NER\n", "38 1 ... RUNX1 regulates genes involved in megakaryocyt...\n", "39 1 ... Orc1 removal from chromatin\n", "40 1 ... RHOG GTPase cycle\n", "41 1 ... G alpha (12/13) signalling events\n", "42 1 ... Integrin cell surface interactions\n", "43 1 ... RAC2 GTPase cycle\n", "44 1 ... Regulation of TP53 Activity through Phosphoryl...\n", "45 1 ... RAC3 GTPase cycle\n", "46 1 ... COPI-dependent Golgi-to-ER retrograde traffic\n", "47 1 ... COPI-mediated anterograde transport\n", "48 1 ... RHOA GTPase cycle\n", "49 1 ... CDC42 GTPase cycle\n", "50 1 ... RAC1 GTPase cycle\n", "51 1 ... Ub-specific processing proteases\n", "\n", "[52 rows x 7 columns]" ] }, "metadata": {}, "execution_count": 29 } ] }, { "cell_type": "markdown", "metadata": { "id": "qM42ztRJDgbM" }, "source": [ "## Calculate P-Values\n", "\n", "BigQuery does not have statistical functions to calculate p-values, so we use the SciPy stats library and update the data frame with a new p-value column:" ] }, { "cell_type": "code", "metadata": { "id": "Beim_ePN2cs7", "colab": { "base_uri": "https://localhost:8080/", "height": 1000 }, "outputId": "6f95818a-7c14-451a-8549-eed4e9598983" }, "source": [ "chi_squared_pathways['p_value'] = 1-stats.chi2.cdf(chi_squared_pathways['chi_squared_stat'], 1)\n", "chi_squared_pathways" ], "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
in_gene_in_pathwayin_gene_not_pathwaynot_gene_in_pathwaynot_gene_not_pathwaychi_squared_statstable_idnamep_value
0219311111433.477023R-HSA-68962Activation of the pre-replicative complex7.211088e-09
112031114232.311989R-HSA-1237112Methionine salvage pathway1.313007e-08
212031114232.311989R-HSA-9013425RHOT1 GTPase cycle1.313007e-08
3219501109520.236790R-HSA-72695Formation of the ternary complex, and subseque...6.842431e-06
412061113918.048197R-HSA-163765ChREBP activates metabolic gene expression2.153825e-05
5219571108817.513505R-HSA-72649Translation initiation complex formation2.852741e-05
6219571108817.513505R-HSA-72702Ribosomal scanning and start codon recognition2.852741e-05
712071113815.671798R-HSA-8850843Phosphate bond hydrolysis by NTPDase proteins7.533920e-05
812071113815.671798R-HSA-68952DNA replication initiation7.533920e-05
9120101113511.137000R-HSA-418885DCC mediated attractive signaling8.462266e-04
10219881105710.567006R-HSA-192823Viral mRNA Translation1.151240e-03
11219921105310.006894R-HSA-1799339SRP-dependent cotranslational protein targetin...1.559553e-03
1221994110519.744550R-HSA-975956Nonsense Mediated Decay (NMD) independent of t...1.798552e-03
13219100110459.020030R-HSA-72689Formation of a pool of free 40S subunits2.670370e-03
14219110110357.987388R-HSA-156827L13a-mediated translational silencing of Cerul...4.710432e-03
15219111110347.894339R-HSA-72706GTP hydrolysis and joining of the 60S ribosoma...4.958976e-03
16219115110307.538335R-HSA-975957Nonsense Mediated Decay (NMD) enhanced by the ...6.039983e-03
1712015111307.362504R-HSA-204174Regulation of pyruvate dehydrogenase (PDH) com...6.659799e-03
18219119110267.206312R-HSA-114608Platelet degranulation7.264762e-03
1912020111255.389681R-HSA-5651801PCNA-Dependent Long Patch Base Excision Repair2.025618e-02
2012023111224.602357R-HSA-203927MicroRNA (miRNA) biogenesis3.192804e-02
2112024111214.382195R-HSA-9660826Purinergic signaling in leishmaniasis infection3.631620e-02
2212024111214.382195R-HSA-210991Basigin interactions3.631620e-02
2312024111214.382195R-HSA-5223345Miscellaneous transport and binding events3.631620e-02
2412024111214.382195R-HSA-5696397Gap-filling DNA repair synthesis and ligation ...3.631620e-02
25219181109643.953753R-HSA-6791226Major pathway of rRNA processing in the nucleo...4.676696e-02
2612029111163.503357R-HSA-110314Recognition of DNA damage by PCNA-containing r...6.124455e-02
2712031111143.229512R-HSA-5656169Termination of translesion DNA synthesis7.232223e-02
2812033111122.988310R-HSA-352230Amino acid transport across the plasma membrane8.386764e-02
2912036111092.676134R-HSA-176187Activation of ATR in response to replication s...1.018627e-01
3012036111092.676134R-HSA-8950505Gene and protein expression by JAK-STAT signal...1.018627e-01
3112037111082.583220R-HSA-5083635Defective B3GALTL causes Peters-plus syndrome ...1.080017e-01
3212038111072.495164R-HSA-5173214O-glycosylation of TSR domain-containing proteins1.141965e-01
3312040111052.332203R-HSA-5696400Dual Incision in GG-NER1.267224e-01
3412054110911.530727R-HSA-9013409RHOJ GTPase cycle2.160034e-01
3512054110911.530727R-HSA-193648NRAGE signals death through JNK2.160034e-01
3612063110821.206546R-HSA-6782210Gap-filling DNA repair synthesis and ligation ...2.720173e-01
3712064110811.176348R-HSA-6782135Dual incision in TC-NER2.781007e-01
3812065110801.147121R-HSA-8936459RUNX1 regulates genes involved in megakaryocyt...2.841526e-01
3912070110751.014107R-HSA-68949Orc1 removal from chromatin3.139209e-01
4012073110720.943520R-HSA-9013408RHOG GTPase cycle3.313742e-01
4112079110660.819465R-HSA-416482G alpha (12/13) signalling events3.653366e-01
4212080110650.800735R-HSA-216083Integrin cell surface interactions3.708738e-01
4312087110580.682709R-HSA-9013404RAC2 GTPase cycle4.086556e-01
4412091110540.624208R-HSA-6804756Regulation of TP53 Activity through Phosphoryl...4.294877e-01
4512093110520.597060R-HSA-9013423RAC3 GTPase cycle4.397019e-01
4612099110460.523015R-HSA-6811434COPI-dependent Golgi-to-ER retrograde traffic4.695582e-01
47120101110440.500559R-HSA-6807878COPI-mediated anterograde transport4.792546e-01
48120149109960.170910R-HSA-8980692RHOA GTPase cycle6.793044e-01
49120158109870.137271R-HSA-9013148CDC42 GTPase cycle7.110095e-01
50120184109610.067711R-HSA-9013149RAC1 GTPase cycle7.946990e-01
51120204109410.034678R-HSA-5689880Ub-specific processing proteases8.522717e-01
\n", "
" ], "text/plain": [ " in_gene_in_pathway ... p_value\n", "0 2 ... 7.211088e-09\n", "1 1 ... 1.313007e-08\n", "2 1 ... 1.313007e-08\n", "3 2 ... 6.842431e-06\n", "4 1 ... 2.153825e-05\n", "5 2 ... 2.852741e-05\n", "6 2 ... 2.852741e-05\n", "7 1 ... 7.533920e-05\n", "8 1 ... 7.533920e-05\n", "9 1 ... 8.462266e-04\n", "10 2 ... 1.151240e-03\n", "11 2 ... 1.559553e-03\n", "12 2 ... 1.798552e-03\n", "13 2 ... 2.670370e-03\n", "14 2 ... 4.710432e-03\n", "15 2 ... 4.958976e-03\n", "16 2 ... 6.039983e-03\n", "17 1 ... 6.659799e-03\n", "18 2 ... 7.264762e-03\n", "19 1 ... 2.025618e-02\n", "20 1 ... 3.192804e-02\n", "21 1 ... 3.631620e-02\n", "22 1 ... 3.631620e-02\n", "23 1 ... 3.631620e-02\n", "24 1 ... 3.631620e-02\n", "25 2 ... 4.676696e-02\n", "26 1 ... 6.124455e-02\n", "27 1 ... 7.232223e-02\n", "28 1 ... 8.386764e-02\n", "29 1 ... 1.018627e-01\n", "30 1 ... 1.018627e-01\n", "31 1 ... 1.080017e-01\n", "32 1 ... 1.141965e-01\n", "33 1 ... 1.267224e-01\n", "34 1 ... 2.160034e-01\n", "35 1 ... 2.160034e-01\n", "36 1 ... 2.720173e-01\n", "37 1 ... 2.781007e-01\n", "38 1 ... 2.841526e-01\n", "39 1 ... 3.139209e-01\n", "40 1 ... 3.313742e-01\n", "41 1 ... 3.653366e-01\n", "42 1 ... 3.708738e-01\n", "43 1 ... 4.086556e-01\n", "44 1 ... 4.294877e-01\n", "45 1 ... 4.397019e-01\n", "46 1 ... 4.695582e-01\n", "47 1 ... 4.792546e-01\n", "48 1 ... 6.793044e-01\n", "49 1 ... 7.110095e-01\n", "50 1 ... 7.946990e-01\n", "51 1 ... 8.522717e-01\n", "\n", "[52 rows x 8 columns]" ] }, "metadata": {}, "execution_count": 30 } ] }, { "cell_type": "markdown", "metadata": { "id": "TqPYE1JZDq5L" }, "source": [ "## Adjust for Multiple Testing\n", "\n", "Since we're testing multiple pathways, we need to adjust the p-value threshold for significance accordingly. The number of pathways tested depends on whether or not we're considering all pathways, or just the lowest level pathways. That count can be obtained with the following query." ] }, { "cell_type": "code", "metadata": { "id": "7KziIxfED8kB" }, "source": [ "# run query and put results in data frame\n", "num_pathways_result = client.query((\"\"\"\n", " SELECT\n", " COUNT (*) AS num_pathways\n", " FROM\n", " `isb-cgc-bq.reactome_versioned.pathway_v77` as pathway\n", " {lowest_level_filter}\n", "\"\"\").format(\n", " lowest_level_filter=(\"WHERE lowest_level = TRUE\" if lowest_level else \"\")\n", ")).result().to_dataframe()" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 80 }, "id": "E_7ncwlOEmuG", "outputId": "2a300402-e3cc-4cdf-f507-536a4ef38fdd" }, "source": [ "# display data frame\n", "num_pathways_result" ], "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
num_pathways
01773
\n", "
" ], "text/plain": [ " num_pathways\n", "0 1773" ] }, "metadata": {}, "execution_count": 32 } ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "7B9__TlA4M9u", "outputId": "0ef921bf-8224-4b14-ce6a-d3418ccaea57" }, "source": [ "# adjust significance threshold for multiple testing, using a p-value of 0.01\n", "num_pathways = num_pathways_result['num_pathways'][0]\n", "significance_threshold = 0.01/num_pathways\n", "print('Significance Threshold: {}'.format(significance_threshold))\n", "\n", "# find all pathways that meet the significance criterion after adjusting for\n", "# multiple testing\n", "significant_pathway_index = chi_squared_pathways['p_value']\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
in_gene_in_pathwayin_gene_not_pathwaynot_gene_in_pathwaynot_gene_not_pathwaychi_squared_statstable_idnamep_value
0219311111433.477023R-HSA-68962Activation of the pre-replicative complex7.211088e-09
112031114232.311989R-HSA-1237112Methionine salvage pathway1.313007e-08
212031114232.311989R-HSA-9013425RHOT1 GTPase cycle1.313007e-08
\n", "" ], "text/plain": [ " in_gene_in_pathway ... p_value\n", "0 2 ... 7.211088e-09\n", "1 1 ... 1.313007e-08\n", "2 1 ... 1.313007e-08\n", "\n", "[3 rows x 8 columns]" ] }, "metadata": {}, "execution_count": 34 } ] }, { "cell_type": "markdown", "metadata": { "id": "if9v8H2T12cp" }, "source": [ "The results of this analysis suggest that at least three pathways may be related to the genes identified. " ] }, { "cell_type": "markdown", "metadata": { "id": "s31t8Ed-7cfK" }, "source": [ "## Verify Results by Comparing to SciPy Chi-Squared Function\n", "\n", "We verify these BigQuery results by calculating the same chi-squared statistic using the SciPy package. Comparing the SciPy p-values and statistics to those of the BigQuery-derived results confirms that they are identical." ] }, { "cell_type": "code", "metadata": { "id": "-BPVmeMF1CGO" }, "source": [ "# extract observed values from bigquery result\n", "observed = chi_squared_pathways[[\n", " 'in_gene_in_pathway',\n", " 'in_gene_not_pathway',\n", " 'not_gene_in_pathway',\n", " 'not_gene_not_pathway'\n", "]]\n", "\n", "# calculate the chi-squared statistic using the scipy stats package \n", "chi2_stat = []\n", "chi2_pvalue = []\n", "for index, row in observed.iterrows():\n", " stat, pvalue, _, _ = stats.chi2_contingency(\n", " np.reshape(np.matrix(row), (2,2)), correction=True\n", " )\n", " chi2_stat.append(stat)\n", " chi2_pvalue.append(pvalue)\n", "\n", "# add columns to the original data frame\n", "chi_squared_pathways['scipy_stat'] = chi2_stat\n", "chi_squared_pathways['scipy_p_value'] = chi2_pvalue" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "WWgNFm6w36s9", "colab": { "base_uri": "https://localhost:8080/", "height": 1000 }, "outputId": "612d8812-294b-43a4-98e8-dad78af21f40" }, "source": [ "# display the updated data frame\n", "chi_squared_pathways" ], "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
in_gene_in_pathwayin_gene_not_pathwaynot_gene_in_pathwaynot_gene_not_pathwaychi_squared_statstable_idnamep_valuescipy_statscipy_p_value
0219311111433.477023R-HSA-68962Activation of the pre-replicative complex7.211088e-0933.4770237.211088e-09
112031114232.311989R-HSA-1237112Methionine salvage pathway1.313007e-0832.3119891.313007e-08
212031114232.311989R-HSA-9013425RHOT1 GTPase cycle1.313007e-0832.3119891.313007e-08
3219501109520.236790R-HSA-72695Formation of the ternary complex, and subseque...6.842431e-0620.2367906.842431e-06
412061113918.048197R-HSA-163765ChREBP activates metabolic gene expression2.153825e-0518.0481972.153825e-05
5219571108817.513505R-HSA-72649Translation initiation complex formation2.852741e-0517.5135052.852741e-05
6219571108817.513505R-HSA-72702Ribosomal scanning and start codon recognition2.852741e-0517.5135052.852741e-05
712071113815.671798R-HSA-8850843Phosphate bond hydrolysis by NTPDase proteins7.533920e-0515.6717987.533920e-05
812071113815.671798R-HSA-68952DNA replication initiation7.533920e-0515.6717987.533920e-05
9120101113511.137000R-HSA-418885DCC mediated attractive signaling8.462266e-0411.1370008.462266e-04
10219881105710.567006R-HSA-192823Viral mRNA Translation1.151240e-0310.5670061.151240e-03
11219921105310.006894R-HSA-1799339SRP-dependent cotranslational protein targetin...1.559553e-0310.0068941.559553e-03
1221994110519.744550R-HSA-975956Nonsense Mediated Decay (NMD) independent of t...1.798552e-039.7445501.798552e-03
13219100110459.020030R-HSA-72689Formation of a pool of free 40S subunits2.670370e-039.0200302.670370e-03
14219110110357.987388R-HSA-156827L13a-mediated translational silencing of Cerul...4.710432e-037.9873884.710432e-03
15219111110347.894339R-HSA-72706GTP hydrolysis and joining of the 60S ribosoma...4.958976e-037.8943394.958976e-03
16219115110307.538335R-HSA-975957Nonsense Mediated Decay (NMD) enhanced by the ...6.039983e-037.5383356.039983e-03
1712015111307.362504R-HSA-204174Regulation of pyruvate dehydrogenase (PDH) com...6.659799e-037.3625046.659799e-03
18219119110267.206312R-HSA-114608Platelet degranulation7.264762e-037.2063127.264762e-03
1912020111255.389681R-HSA-5651801PCNA-Dependent Long Patch Base Excision Repair2.025618e-025.3896812.025618e-02
2012023111224.602357R-HSA-203927MicroRNA (miRNA) biogenesis3.192804e-024.6023573.192804e-02
2112024111214.382195R-HSA-9660826Purinergic signaling in leishmaniasis infection3.631620e-024.3821953.631620e-02
2212024111214.382195R-HSA-210991Basigin interactions3.631620e-024.3821953.631620e-02
2312024111214.382195R-HSA-5223345Miscellaneous transport and binding events3.631620e-024.3821953.631620e-02
2412024111214.382195R-HSA-5696397Gap-filling DNA repair synthesis and ligation ...3.631620e-024.3821953.631620e-02
25219181109643.953753R-HSA-6791226Major pathway of rRNA processing in the nucleo...4.676696e-023.9537534.676696e-02
2612029111163.503357R-HSA-110314Recognition of DNA damage by PCNA-containing r...6.124455e-023.5033576.124455e-02
2712031111143.229512R-HSA-5656169Termination of translesion DNA synthesis7.232223e-023.2295127.232223e-02
2812033111122.988310R-HSA-352230Amino acid transport across the plasma membrane8.386764e-022.9883108.386764e-02
2912036111092.676134R-HSA-176187Activation of ATR in response to replication s...1.018627e-012.6761341.018627e-01
3012036111092.676134R-HSA-8950505Gene and protein expression by JAK-STAT signal...1.018627e-012.6761341.018627e-01
3112037111082.583220R-HSA-5083635Defective B3GALTL causes Peters-plus syndrome ...1.080017e-012.5832201.080017e-01
3212038111072.495164R-HSA-5173214O-glycosylation of TSR domain-containing proteins1.141965e-012.4951641.141965e-01
3312040111052.332203R-HSA-5696400Dual Incision in GG-NER1.267224e-012.3322031.267224e-01
3412054110911.530727R-HSA-9013409RHOJ GTPase cycle2.160034e-011.5307272.160034e-01
3512054110911.530727R-HSA-193648NRAGE signals death through JNK2.160034e-011.5307272.160034e-01
3612063110821.206546R-HSA-6782210Gap-filling DNA repair synthesis and ligation ...2.720173e-011.2065462.720173e-01
3712064110811.176348R-HSA-6782135Dual incision in TC-NER2.781007e-011.1763482.781007e-01
3812065110801.147121R-HSA-8936459RUNX1 regulates genes involved in megakaryocyt...2.841526e-011.1471212.841526e-01
3912070110751.014107R-HSA-68949Orc1 removal from chromatin3.139209e-011.0141073.139209e-01
4012073110720.943520R-HSA-9013408RHOG GTPase cycle3.313742e-010.9435203.313742e-01
4112079110660.819465R-HSA-416482G alpha (12/13) signalling events3.653366e-010.8194653.653366e-01
4212080110650.800735R-HSA-216083Integrin cell surface interactions3.708738e-010.8007353.708738e-01
4312087110580.682709R-HSA-9013404RAC2 GTPase cycle4.086556e-010.6827094.086556e-01
4412091110540.624208R-HSA-6804756Regulation of TP53 Activity through Phosphoryl...4.294877e-010.6242084.294877e-01
4512093110520.597060R-HSA-9013423RAC3 GTPase cycle4.397019e-010.5970604.397019e-01
4612099110460.523015R-HSA-6811434COPI-dependent Golgi-to-ER retrograde traffic4.695582e-010.5230154.695582e-01
47120101110440.500559R-HSA-6807878COPI-mediated anterograde transport4.792546e-010.5005594.792546e-01
48120149109960.170910R-HSA-8980692RHOA GTPase cycle6.793044e-010.1709106.793044e-01
49120158109870.137271R-HSA-9013148CDC42 GTPase cycle7.110095e-010.1372717.110095e-01
50120184109610.067711R-HSA-9013149RAC1 GTPase cycle7.946990e-010.0677117.946990e-01
51120204109410.034678R-HSA-5689880Ub-specific processing proteases8.522717e-010.0346788.522717e-01
\n", "
" ], "text/plain": [ " in_gene_in_pathway in_gene_not_pathway ... scipy_stat scipy_p_value\n", "0 2 19 ... 33.477023 7.211088e-09\n", "1 1 20 ... 32.311989 1.313007e-08\n", "2 1 20 ... 32.311989 1.313007e-08\n", "3 2 19 ... 20.236790 6.842431e-06\n", "4 1 20 ... 18.048197 2.153825e-05\n", "5 2 19 ... 17.513505 2.852741e-05\n", "6 2 19 ... 17.513505 2.852741e-05\n", "7 1 20 ... 15.671798 7.533920e-05\n", "8 1 20 ... 15.671798 7.533920e-05\n", "9 1 20 ... 11.137000 8.462266e-04\n", "10 2 19 ... 10.567006 1.151240e-03\n", "11 2 19 ... 10.006894 1.559553e-03\n", "12 2 19 ... 9.744550 1.798552e-03\n", "13 2 19 ... 9.020030 2.670370e-03\n", "14 2 19 ... 7.987388 4.710432e-03\n", "15 2 19 ... 7.894339 4.958976e-03\n", "16 2 19 ... 7.538335 6.039983e-03\n", "17 1 20 ... 7.362504 6.659799e-03\n", "18 2 19 ... 7.206312 7.264762e-03\n", "19 1 20 ... 5.389681 2.025618e-02\n", "20 1 20 ... 4.602357 3.192804e-02\n", "21 1 20 ... 4.382195 3.631620e-02\n", "22 1 20 ... 4.382195 3.631620e-02\n", "23 1 20 ... 4.382195 3.631620e-02\n", "24 1 20 ... 4.382195 3.631620e-02\n", "25 2 19 ... 3.953753 4.676696e-02\n", "26 1 20 ... 3.503357 6.124455e-02\n", "27 1 20 ... 3.229512 7.232223e-02\n", "28 1 20 ... 2.988310 8.386764e-02\n", "29 1 20 ... 2.676134 1.018627e-01\n", "30 1 20 ... 2.676134 1.018627e-01\n", "31 1 20 ... 2.583220 1.080017e-01\n", "32 1 20 ... 2.495164 1.141965e-01\n", "33 1 20 ... 2.332203 1.267224e-01\n", "34 1 20 ... 1.530727 2.160034e-01\n", "35 1 20 ... 1.530727 2.160034e-01\n", "36 1 20 ... 1.206546 2.720173e-01\n", "37 1 20 ... 1.176348 2.781007e-01\n", "38 1 20 ... 1.147121 2.841526e-01\n", "39 1 20 ... 1.014107 3.139209e-01\n", "40 1 20 ... 0.943520 3.313742e-01\n", "41 1 20 ... 0.819465 3.653366e-01\n", "42 1 20 ... 0.800735 3.708738e-01\n", "43 1 20 ... 0.682709 4.086556e-01\n", "44 1 20 ... 0.624208 4.294877e-01\n", "45 1 20 ... 0.597060 4.397019e-01\n", "46 1 20 ... 0.523015 4.695582e-01\n", "47 1 20 ... 0.500559 4.792546e-01\n", "48 1 20 ... 0.170910 6.793044e-01\n", "49 1 20 ... 0.137271 7.110095e-01\n", "50 1 20 ... 0.067711 7.946990e-01\n", "51 1 20 ... 0.034678 8.522717e-01\n", "\n", "[52 rows x 10 columns]" ] }, "metadata": {}, "execution_count": 36 } ] }, { "cell_type": "markdown", "metadata": { "id": "HBGSd1wS5GRK" }, "source": [ "# Conclusion\n", "\n", "This notebook demonstrated usage of the Reactome BigQuery dataset for basic cancer pathway identification from a gene set, as well as a more complex pathway enrichment analysis using a chi-squared statistic." ] } ] }