{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "How_to_use_the_Reactome_BQ_dataset.ipynb",
"provenance": [],
"collapsed_sections": [],
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "qgwv4lK5V8Zc"
},
"source": [
"# How to use the Reactome BigQuery dataset\n",
"Check out other notebooks at our [Community Notebooks Repository](https://github.com/isb-cgc/Community-Notebooks)!\n",
"\n",
"- **Title:** How to use the Reactome BigQuery dataset\n",
"- **Author:** John Phan\n",
"- **Created:** 2021-09-13\n",
"- **Purpose:** Demonstrate basic usage of the Reactome BigQuery dataset\n",
"- **URL:** https://github.com/isb-cgc/Community-Notebooks/blob/master/Notebooks/How_to_use_the_Reactome_BQ_dataset.ipynb\n",
"\n",
"This notebook demonstrates basic usage of the Reactome BigQuery dataset. Analysis of this dataset can provide a powerful tool for identifying pathways related to cancer biomarkers. \n",
"\n",
"The Reactome is an open-source, manually curated, and peer-reviewed pathway database. More information can be found here: https://reactome.org/. \n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "1zVfEL8i8__O"
},
"source": [
"# Initialize Notebook Environment\n",
"\n",
"Before running the analysis, we need to load dependencies, authenticate to BigQuery, and customize notebook parameters."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ujUW1wF23d93"
},
"source": [
"## Import Dependencies"
]
},
{
"cell_type": "code",
"metadata": {
"id": "LNlADvO-gken"
},
"source": [
"# GCP Libraries\n",
"from google.cloud import bigquery\n",
"from google.colab import auth\n",
"\n",
"# Data Analytics\n",
"import numpy as np\n",
"from scipy import stats"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "LBHnu8bZDEcE"
},
"source": [
"## Authenticate\n",
"\n",
"Before using BigQuery, we need to get authorization for access to BigQuery and the Google Cloud. For more information see ['Quick Start Guide to ISB-CGC'](https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/HowToGetStartedonISB-CGC.html). Alternative authentication methods can be found [here](https://googleapis.dev/python/google-api-core/latest/auth.html)."
]
},
{
"cell_type": "code",
"metadata": {
"id": "M6BEFHj2dd5a"
},
"source": [
"# if you're using Google Colab, authenticate to gcloud with the following\n",
"auth.authenticate_user()\n",
"\n",
"# alternatively, use the gcloud SDK\n",
"#!gcloud auth application-default login"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "RWuECMsRptWJ"
},
"source": [
"## Parameters\n",
"\n",
"Customize the following parameters based on your notebook, execution environment, or project."
]
},
{
"cell_type": "code",
"metadata": {
"id": "veP6gPatu2iW"
},
"source": [
"# set the google project that will be billed for this notebook's computations\n",
"google_project = 'google-project' ## change me"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "F6zTODbDq3g9"
},
"source": [
"## BigQuery Client\n",
"\n",
"Create the BigQuery client."
]
},
{
"cell_type": "code",
"metadata": {
"id": "HNkfSXedwzbC"
},
"source": [
"# Create a client to access the data within BigQuery\n",
"client = bigquery.Client(google_project)"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "7kVZ88dU85Vs"
},
"source": [
"# Identify all Reactome Pathways Related to Genes\n",
"\n",
"We can join tables from the Reactome BigQuery dataset to identify all pathways related to our genes of interest. We first have to map the gene names to Uniprot IDs in the Reactome \"physical entities\" table. These can then be mapped to Reactome pathways. We further filter the physical entity to pathway evidence codes to retain only interactions that have \"Traceable Author Statements\" (TAS) rather than just \"Inferred from Electronic Annotation\" (IEA) to avoid evidence that have not been manually curated. \n",
"\n",
"We use the following genes to identify related pathways. These genes were identified in an ovarian cancer chemo-response study by [Bosquet et al](https://molecular-cancer.biomedcentral.com/articles/10.1186/s12943-016-0548-9). "
]
},
{
"cell_type": "code",
"metadata": {
"id": "4QDl8QK9dZJf"
},
"source": [
"# set parameters for query\n",
"genes = \"'RHOT1','MYO7A','ZBTB10','MATK','ST18','RPS23','GCNT1','DROSHA','NUAK1','CCPG1',\\\n",
"'PDGFD','KLRAP1','MTAP','RNF13','THBS1','MLX','FAP','TIMP3','PRSS1','SLC7A11',\\\n",
"'OLFML3','RPS20','MCM5','POLE','STEAP4','LRRC8D','WBP1L','ENTPD5','SYNE1','DPT',\\\n",
"'COPZ2','TRIO','PDPR'\""
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "ZwJGDXql9j6P"
},
"source": [
"# run query and put results in data frame\n",
"pathways = client.query((\"\"\"\n",
" SELECT\n",
" DISTINCT pathway.*\n",
"\n",
" FROM\n",
" `isb-cgc-bq.reactome_versioned.pathway_v77` as pathway\n",
"\n",
" INNER JOIN `isb-cgc-bq.reactome_versioned.pe_to_pathway_v77` as pe2pathway\n",
" -- link pathways to physical entities via intermediate table\n",
" ON pathway.stable_id = pe2pathway.pathway_stable_id\n",
"\n",
" INNER JOIN `isb-cgc-bq.reactome_versioned.physical_entity_v77` AS pe\n",
" -- link pathways to physical entities\n",
" ON pe2pathway.pe_stable_id = pe.stable_id\n",
"\n",
" WHERE\n",
" -- filter by stronger evidence: \"Traceable Author Statement\" \n",
" pe2pathway.evidence_code = 'TAS'\n",
"\n",
" -- filter by pathways that are related to genes in list\n",
" AND pe.name IN ({genes}) \n",
"\n",
" ORDER BY pathway.name ASC\n",
"\"\"\").format(\n",
" genes=genes\n",
")).result().to_dataframe()"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 419
},
"id": "DmkVEYCHfcWK",
"outputId": "dc724070-e5f7-40f5-fce8-b3bba4aec5a4"
},
"source": [
"# Display the data frame\n",
"pathways"
],
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" stable_id | \n",
" url | \n",
" name | \n",
" species | \n",
" lowest_level | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" R-HSA-176187 | \n",
" https://reactome.org/PathwayBrowser/#/R-HSA-17... | \n",
" Activation of ATR in response to replication s... | \n",
" Homo sapiens | \n",
" True | \n",
"
\n",
" \n",
" | 1 | \n",
" R-HSA-72662 | \n",
" https://reactome.org/PathwayBrowser/#/R-HSA-72662 | \n",
" Activation of the mRNA upon binding of the cap... | \n",
" Homo sapiens | \n",
" False | \n",
"
\n",
" \n",
" | 2 | \n",
" R-HSA-68962 | \n",
" https://reactome.org/PathwayBrowser/#/R-HSA-68962 | \n",
" Activation of the pre-replicative complex | \n",
" Homo sapiens | \n",
" True | \n",
"
\n",
" \n",
" | 3 | \n",
" R-HSA-352230 | \n",
" https://reactome.org/PathwayBrowser/#/R-HSA-35... | \n",
" Amino acid transport across the plasma membrane | \n",
" Homo sapiens | \n",
" True | \n",
"
\n",
" \n",
" | 4 | \n",
" R-HSA-446203 | \n",
" https://reactome.org/PathwayBrowser/#/R-HSA-44... | \n",
" Asparagine N-linked glycosylation | \n",
" Homo sapiens | \n",
" False | \n",
"
\n",
" \n",
" | ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" | 148 | \n",
" R-HSA-192823 | \n",
" https://reactome.org/PathwayBrowser/#/R-HSA-19... | \n",
" Viral mRNA Translation | \n",
" Homo sapiens | \n",
" True | \n",
"
\n",
" \n",
" | 149 | \n",
" R-HSA-2187338 | \n",
" https://reactome.org/PathwayBrowser/#/R-HSA-21... | \n",
" Visual phototransduction | \n",
" Homo sapiens | \n",
" False | \n",
"
\n",
" \n",
" | 150 | \n",
" R-HSA-193704 | \n",
" https://reactome.org/PathwayBrowser/#/R-HSA-19... | \n",
" p75 NTR receptor-mediated signalling | \n",
" Homo sapiens | \n",
" False | \n",
"
\n",
" \n",
" | 151 | \n",
" R-HSA-72312 | \n",
" https://reactome.org/PathwayBrowser/#/R-HSA-72312 | \n",
" rRNA processing | \n",
" Homo sapiens | \n",
" False | \n",
"
\n",
" \n",
" | 152 | \n",
" R-HSA-8868773 | \n",
" https://reactome.org/PathwayBrowser/#/R-HSA-88... | \n",
" rRNA processing in the nucleus and cytosol | \n",
" Homo sapiens | \n",
" False | \n",
"
\n",
" \n",
"
\n",
"
153 rows × 5 columns
\n",
"
"
],
"text/plain": [
" stable_id ... lowest_level\n",
"0 R-HSA-176187 ... True\n",
"1 R-HSA-72662 ... False\n",
"2 R-HSA-68962 ... True\n",
"3 R-HSA-352230 ... True\n",
"4 R-HSA-446203 ... False\n",
".. ... ... ...\n",
"148 R-HSA-192823 ... True\n",
"149 R-HSA-2187338 ... False\n",
"150 R-HSA-193704 ... False\n",
"151 R-HSA-72312 ... False\n",
"152 R-HSA-8868773 ... False\n",
"\n",
"[153 rows x 5 columns]"
]
},
"metadata": {},
"execution_count": 7
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "4IT_WO6PO0bH"
},
"source": [
"If we're only interested in the lowest level pathways, i.e., pathways that are not parents of other pathways in the hierarchy, we can filter by the \"lowest_level\" field in the pathways table."
]
},
{
"cell_type": "code",
"metadata": {
"id": "IQcGozM6OzpW"
},
"source": [
"# run query and put results in data frame\n",
"lowest_pathways = client.query((\"\"\"\n",
" SELECT\n",
" DISTINCT pathway.*\n",
"\n",
" FROM\n",
" `isb-cgc-bq.reactome_versioned.pathway_v77` as pathway\n",
"\n",
" INNER JOIN `isb-cgc-bq.reactome_versioned.pe_to_pathway_v77` as pe2pathway\n",
" -- link pathways to physical entities via intermediate table\n",
" ON pathway.stable_id = pe2pathway.pathway_stable_id\n",
"\n",
" INNER JOIN `isb-cgc-bq.reactome_versioned.physical_entity_v77` AS pe\n",
" -- link pathways to physical entities\n",
" ON pe2pathway.pe_stable_id = pe.stable_id\n",
"\n",
" WHERE\n",
" -- filter by stronger evidence: \"Traceable Author Statement\" \n",
" pe2pathway.evidence_code = 'TAS'\n",
"\n",
" -- filter by pathways that are related to genes in list\n",
" AND pe.name IN ({genes})\n",
"\n",
" -- filter to include just lowest level pathways\n",
" AND pathway.lowest_level = TRUE\n",
"\n",
" ORDER BY pathway.name ASC\n",
"\"\"\").format(\n",
" genes=genes\n",
")).result().to_dataframe()"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 1000
},
"id": "q5bZg8w6Pn7Q",
"outputId": "8096e367-7452-4814-d783-90df5cda9632"
},
"source": [
"# display data frame\n",
"lowest_pathways"
],
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" stable_id | \n",
" url | \n",
" name | \n",
" species | \n",
" lowest_level | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" R-HSA-176187 | \n",
" https://reactome.org/PathwayBrowser/#/R-HSA-17... | \n",
" Activation of ATR in response to replication s... | \n",
" Homo sapiens | \n",
" True | \n",
"
\n",
" \n",
" | 1 | \n",
" R-HSA-68962 | \n",
" https://reactome.org/PathwayBrowser/#/R-HSA-68962 | \n",
" Activation of the pre-replicative complex | \n",
" Homo sapiens | \n",
" True | \n",
"
\n",
" \n",
" | 2 | \n",
" R-HSA-352230 | \n",
" https://reactome.org/PathwayBrowser/#/R-HSA-35... | \n",
" Amino acid transport across the plasma membrane | \n",
" Homo sapiens | \n",
" True | \n",
"
\n",
" \n",
" | 3 | \n",
" R-HSA-210991 | \n",
" https://reactome.org/PathwayBrowser/#/R-HSA-21... | \n",
" Basigin interactions | \n",
" Homo sapiens | \n",
" True | \n",
"
\n",
" \n",
" | 4 | \n",
" R-HSA-9013148 | \n",
" https://reactome.org/PathwayBrowser/#/R-HSA-90... | \n",
" CDC42 GTPase cycle | \n",
" Homo sapiens | \n",
" True | \n",
"
\n",
" \n",
" | 5 | \n",
" R-HSA-6811434 | \n",
" https://reactome.org/PathwayBrowser/#/R-HSA-68... | \n",
" COPI-dependent Golgi-to-ER retrograde traffic | \n",
" Homo sapiens | \n",
" True | \n",
"
\n",
" \n",
" | 6 | \n",
" R-HSA-6807878 | \n",
" https://reactome.org/PathwayBrowser/#/R-HSA-68... | \n",
" COPI-mediated anterograde transport | \n",
" Homo sapiens | \n",
" True | \n",
"
\n",
" \n",
" | 7 | \n",
" R-HSA-163765 | \n",
" https://reactome.org/PathwayBrowser/#/R-HSA-16... | \n",
" ChREBP activates metabolic gene expression | \n",
" Homo sapiens | \n",
" True | \n",
"
\n",
" \n",
" | 8 | \n",
" R-HSA-418885 | \n",
" https://reactome.org/PathwayBrowser/#/R-HSA-41... | \n",
" DCC mediated attractive signaling | \n",
" Homo sapiens | \n",
" True | \n",
"
\n",
" \n",
" | 9 | \n",
" R-HSA-68952 | \n",
" https://reactome.org/PathwayBrowser/#/R-HSA-68952 | \n",
" DNA replication initiation | \n",
" Homo sapiens | \n",
" True | \n",
"
\n",
" \n",
" | 10 | \n",
" R-HSA-5696400 | \n",
" https://reactome.org/PathwayBrowser/#/R-HSA-56... | \n",
" Dual Incision in GG-NER | \n",
" Homo sapiens | \n",
" True | \n",
"
\n",
" \n",
" | 11 | \n",
" R-HSA-6782135 | \n",
" https://reactome.org/PathwayBrowser/#/R-HSA-67... | \n",
" Dual incision in TC-NER | \n",
" Homo sapiens | \n",
" True | \n",
"
\n",
" \n",
" | 12 | \n",
" R-HSA-72689 | \n",
" https://reactome.org/PathwayBrowser/#/R-HSA-72689 | \n",
" Formation of a pool of free 40S subunits | \n",
" Homo sapiens | \n",
" True | \n",
"
\n",
" \n",
" | 13 | \n",
" R-HSA-72695 | \n",
" https://reactome.org/PathwayBrowser/#/R-HSA-72695 | \n",
" Formation of the ternary complex, and subseque... | \n",
" Homo sapiens | \n",
" True | \n",
"
\n",
" \n",
" | 14 | \n",
" R-HSA-416482 | \n",
" https://reactome.org/PathwayBrowser/#/R-HSA-41... | \n",
" G alpha (12/13) signalling events | \n",
" Homo sapiens | \n",
" True | \n",
"
\n",
" \n",
" | 15 | \n",
" R-HSA-72706 | \n",
" https://reactome.org/PathwayBrowser/#/R-HSA-72706 | \n",
" GTP hydrolysis and joining of the 60S ribosoma... | \n",
" Homo sapiens | \n",
" True | \n",
"
\n",
" \n",
" | 16 | \n",
" R-HSA-5696397 | \n",
" https://reactome.org/PathwayBrowser/#/R-HSA-56... | \n",
" Gap-filling DNA repair synthesis and ligation ... | \n",
" Homo sapiens | \n",
" True | \n",
"
\n",
" \n",
" | 17 | \n",
" R-HSA-6782210 | \n",
" https://reactome.org/PathwayBrowser/#/R-HSA-67... | \n",
" Gap-filling DNA repair synthesis and ligation ... | \n",
" Homo sapiens | \n",
" True | \n",
"
\n",
" \n",
" | 18 | \n",
" R-HSA-8950505 | \n",
" https://reactome.org/PathwayBrowser/#/R-HSA-89... | \n",
" Gene and protein expression by JAK-STAT signal... | \n",
" Homo sapiens | \n",
" True | \n",
"
\n",
" \n",
" | 19 | \n",
" R-HSA-216083 | \n",
" https://reactome.org/PathwayBrowser/#/R-HSA-21... | \n",
" Integrin cell surface interactions | \n",
" Homo sapiens | \n",
" True | \n",
"
\n",
" \n",
" | 20 | \n",
" R-HSA-156827 | \n",
" https://reactome.org/PathwayBrowser/#/R-HSA-15... | \n",
" L13a-mediated translational silencing of Cerul... | \n",
" Homo sapiens | \n",
" True | \n",
"
\n",
" \n",
" | 21 | \n",
" R-HSA-6791226 | \n",
" https://reactome.org/PathwayBrowser/#/R-HSA-67... | \n",
" Major pathway of rRNA processing in the nucleo... | \n",
" Homo sapiens | \n",
" True | \n",
"
\n",
" \n",
" | 22 | \n",
" R-HSA-1237112 | \n",
" https://reactome.org/PathwayBrowser/#/R-HSA-12... | \n",
" Methionine salvage pathway | \n",
" Homo sapiens | \n",
" True | \n",
"
\n",
" \n",
" | 23 | \n",
" R-HSA-203927 | \n",
" https://reactome.org/PathwayBrowser/#/R-HSA-20... | \n",
" MicroRNA (miRNA) biogenesis | \n",
" Homo sapiens | \n",
" True | \n",
"
\n",
" \n",
" | 24 | \n",
" R-HSA-5223345 | \n",
" https://reactome.org/PathwayBrowser/#/R-HSA-52... | \n",
" Miscellaneous transport and binding events | \n",
" Homo sapiens | \n",
" True | \n",
"
\n",
" \n",
" | 25 | \n",
" R-HSA-193648 | \n",
" https://reactome.org/PathwayBrowser/#/R-HSA-19... | \n",
" NRAGE signals death through JNK | \n",
" Homo sapiens | \n",
" True | \n",
"
\n",
" \n",
" | 26 | \n",
" R-HSA-975957 | \n",
" https://reactome.org/PathwayBrowser/#/R-HSA-97... | \n",
" Nonsense Mediated Decay (NMD) enhanced by the ... | \n",
" Homo sapiens | \n",
" True | \n",
"
\n",
" \n",
" | 27 | \n",
" R-HSA-975956 | \n",
" https://reactome.org/PathwayBrowser/#/R-HSA-97... | \n",
" Nonsense Mediated Decay (NMD) independent of t... | \n",
" Homo sapiens | \n",
" True | \n",
"
\n",
" \n",
" | 28 | \n",
" R-HSA-5173214 | \n",
" https://reactome.org/PathwayBrowser/#/R-HSA-51... | \n",
" O-glycosylation of TSR domain-containing proteins | \n",
" Homo sapiens | \n",
" True | \n",
"
\n",
" \n",
" | 29 | \n",
" R-HSA-68949 | \n",
" https://reactome.org/PathwayBrowser/#/R-HSA-68949 | \n",
" Orc1 removal from chromatin | \n",
" Homo sapiens | \n",
" True | \n",
"
\n",
" \n",
" | 30 | \n",
" R-HSA-5651801 | \n",
" https://reactome.org/PathwayBrowser/#/R-HSA-56... | \n",
" PCNA-Dependent Long Patch Base Excision Repair | \n",
" Homo sapiens | \n",
" True | \n",
"
\n",
" \n",
" | 31 | \n",
" R-HSA-8850843 | \n",
" https://reactome.org/PathwayBrowser/#/R-HSA-88... | \n",
" Phosphate bond hydrolysis by NTPDase proteins | \n",
" Homo sapiens | \n",
" True | \n",
"
\n",
" \n",
" | 32 | \n",
" R-HSA-114608 | \n",
" https://reactome.org/PathwayBrowser/#/R-HSA-11... | \n",
" Platelet degranulation | \n",
" Homo sapiens | \n",
" True | \n",
"
\n",
" \n",
" | 33 | \n",
" R-HSA-9660826 | \n",
" https://reactome.org/PathwayBrowser/#/R-HSA-96... | \n",
" Purinergic signaling in leishmaniasis infection | \n",
" Homo sapiens | \n",
" True | \n",
"
\n",
" \n",
" | 34 | \n",
" R-HSA-9013149 | \n",
" https://reactome.org/PathwayBrowser/#/R-HSA-90... | \n",
" RAC1 GTPase cycle | \n",
" Homo sapiens | \n",
" True | \n",
"
\n",
" \n",
" | 35 | \n",
" R-HSA-9013404 | \n",
" https://reactome.org/PathwayBrowser/#/R-HSA-90... | \n",
" RAC2 GTPase cycle | \n",
" Homo sapiens | \n",
" True | \n",
"
\n",
" \n",
" | 36 | \n",
" R-HSA-9013423 | \n",
" https://reactome.org/PathwayBrowser/#/R-HSA-90... | \n",
" RAC3 GTPase cycle | \n",
" Homo sapiens | \n",
" True | \n",
"
\n",
" \n",
" | 37 | \n",
" R-HSA-8980692 | \n",
" https://reactome.org/PathwayBrowser/#/R-HSA-89... | \n",
" RHOA GTPase cycle | \n",
" Homo sapiens | \n",
" True | \n",
"
\n",
" \n",
" | 38 | \n",
" R-HSA-9013408 | \n",
" https://reactome.org/PathwayBrowser/#/R-HSA-90... | \n",
" RHOG GTPase cycle | \n",
" Homo sapiens | \n",
" True | \n",
"
\n",
" \n",
" | 39 | \n",
" R-HSA-9013409 | \n",
" https://reactome.org/PathwayBrowser/#/R-HSA-90... | \n",
" RHOJ GTPase cycle | \n",
" Homo sapiens | \n",
" True | \n",
"
\n",
" \n",
" | 40 | \n",
" R-HSA-9013425 | \n",
" https://reactome.org/PathwayBrowser/#/R-HSA-90... | \n",
" RHOT1 GTPase cycle | \n",
" Homo sapiens | \n",
" True | \n",
"
\n",
" \n",
" | 41 | \n",
" R-HSA-8936459 | \n",
" https://reactome.org/PathwayBrowser/#/R-HSA-89... | \n",
" RUNX1 regulates genes involved in megakaryocyt... | \n",
" Homo sapiens | \n",
" True | \n",
"
\n",
" \n",
" | 42 | \n",
" R-HSA-110314 | \n",
" https://reactome.org/PathwayBrowser/#/R-HSA-11... | \n",
" Recognition of DNA damage by PCNA-containing r... | \n",
" Homo sapiens | \n",
" True | \n",
"
\n",
" \n",
" | 43 | \n",
" R-HSA-6804756 | \n",
" https://reactome.org/PathwayBrowser/#/R-HSA-68... | \n",
" Regulation of TP53 Activity through Phosphoryl... | \n",
" Homo sapiens | \n",
" True | \n",
"
\n",
" \n",
" | 44 | \n",
" R-HSA-204174 | \n",
" https://reactome.org/PathwayBrowser/#/R-HSA-20... | \n",
" Regulation of pyruvate dehydrogenase (PDH) com... | \n",
" Homo sapiens | \n",
" True | \n",
"
\n",
" \n",
" | 45 | \n",
" R-HSA-72702 | \n",
" https://reactome.org/PathwayBrowser/#/R-HSA-72702 | \n",
" Ribosomal scanning and start codon recognition | \n",
" Homo sapiens | \n",
" True | \n",
"
\n",
" \n",
" | 46 | \n",
" R-HSA-1799339 | \n",
" https://reactome.org/PathwayBrowser/#/R-HSA-17... | \n",
" SRP-dependent cotranslational protein targetin... | \n",
" Homo sapiens | \n",
" True | \n",
"
\n",
" \n",
" | 47 | \n",
" R-HSA-5656169 | \n",
" https://reactome.org/PathwayBrowser/#/R-HSA-56... | \n",
" Termination of translesion DNA synthesis | \n",
" Homo sapiens | \n",
" True | \n",
"
\n",
" \n",
" | 48 | \n",
" R-HSA-72649 | \n",
" https://reactome.org/PathwayBrowser/#/R-HSA-72649 | \n",
" Translation initiation complex formation | \n",
" Homo sapiens | \n",
" True | \n",
"
\n",
" \n",
" | 49 | \n",
" R-HSA-5689880 | \n",
" https://reactome.org/PathwayBrowser/#/R-HSA-56... | \n",
" Ub-specific processing proteases | \n",
" Homo sapiens | \n",
" True | \n",
"
\n",
" \n",
" | 50 | \n",
" R-HSA-192823 | \n",
" https://reactome.org/PathwayBrowser/#/R-HSA-19... | \n",
" Viral mRNA Translation | \n",
" Homo sapiens | \n",
" True | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" stable_id ... lowest_level\n",
"0 R-HSA-176187 ... True\n",
"1 R-HSA-68962 ... True\n",
"2 R-HSA-352230 ... True\n",
"3 R-HSA-210991 ... True\n",
"4 R-HSA-9013148 ... True\n",
"5 R-HSA-6811434 ... True\n",
"6 R-HSA-6807878 ... True\n",
"7 R-HSA-163765 ... True\n",
"8 R-HSA-418885 ... True\n",
"9 R-HSA-68952 ... True\n",
"10 R-HSA-5696400 ... True\n",
"11 R-HSA-6782135 ... True\n",
"12 R-HSA-72689 ... True\n",
"13 R-HSA-72695 ... True\n",
"14 R-HSA-416482 ... True\n",
"15 R-HSA-72706 ... True\n",
"16 R-HSA-5696397 ... True\n",
"17 R-HSA-6782210 ... True\n",
"18 R-HSA-8950505 ... True\n",
"19 R-HSA-216083 ... True\n",
"20 R-HSA-156827 ... True\n",
"21 R-HSA-6791226 ... True\n",
"22 R-HSA-1237112 ... True\n",
"23 R-HSA-203927 ... True\n",
"24 R-HSA-5223345 ... True\n",
"25 R-HSA-193648 ... True\n",
"26 R-HSA-975957 ... True\n",
"27 R-HSA-975956 ... True\n",
"28 R-HSA-5173214 ... True\n",
"29 R-HSA-68949 ... True\n",
"30 R-HSA-5651801 ... True\n",
"31 R-HSA-8850843 ... True\n",
"32 R-HSA-114608 ... True\n",
"33 R-HSA-9660826 ... True\n",
"34 R-HSA-9013149 ... True\n",
"35 R-HSA-9013404 ... True\n",
"36 R-HSA-9013423 ... True\n",
"37 R-HSA-8980692 ... True\n",
"38 R-HSA-9013408 ... True\n",
"39 R-HSA-9013409 ... True\n",
"40 R-HSA-9013425 ... True\n",
"41 R-HSA-8936459 ... True\n",
"42 R-HSA-110314 ... True\n",
"43 R-HSA-6804756 ... True\n",
"44 R-HSA-204174 ... True\n",
"45 R-HSA-72702 ... True\n",
"46 R-HSA-1799339 ... True\n",
"47 R-HSA-5656169 ... True\n",
"48 R-HSA-72649 ... True\n",
"49 R-HSA-5689880 ... True\n",
"50 R-HSA-192823 ... True\n",
"\n",
"[51 rows x 5 columns]"
]
},
"metadata": {},
"execution_count": 9
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ybqSP2aVNE99"
},
"source": [
"# Pathway Enrichment Analysis\n",
"We can identify pathways that are \"enriched\" with the genes of interest. In other words, we can answer the question: given a set of interesting genes, which pathways contain those genes at a frequency higher than random chance? By calculating the probability that a number of target genes are contained in each pathway, we can identify pathways most likely to be related. \n",
"\n",
"To do this, we can use a **chi-squared** test to determine if there is a statistically significant difference between the expected frequency of genes in a pathway compared to the observed frequency. \n"
]
},
{
"cell_type": "code",
"metadata": {
"id": "sKLy2sxQNIu6"
},
"source": [
"# set query parameters\n",
"lowest_level = True # only show pathways at the lowest level"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "jDdXWbOHa32V"
},
"source": [
"## Construct SQL Query\n",
"\n",
"A single query can be used to calculate the chi-squared statistic for all pathways. This query is rather lengthy, but can be broken up into a series of named sub-queries. We step through each query below.\n",
"\n",
"First, we write a query that simply gets a list of all genes in the Reactome physical entity table: "
]
},
{
"cell_type": "code",
"metadata": {
"id": "r3NeGvfMbgBp"
},
"source": [
"gene_list_query = \"\"\"\n",
" -- Table that contains a list of all distinct genes that map to Reactome\n",
" -- physical entities \n",
" SELECT\n",
" DISTINCT pe.uniprot_id\n",
"\n",
" FROM\n",
" `isb-cgc-bq.reactome_versioned.physical_entity_v77` AS pe\n",
"\n",
" WHERE\n",
" -- filter by pathways that are related to genes in list\n",
" pe.name IN ({genes})\n",
"\"\"\".format(\n",
" genes=genes\n",
").strip(\"\\n\")"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "wTVNQFJjgQ1y",
"outputId": "b089723a-6c24-4185-9c43-8e34d02f6a55"
},
"source": [
"print(gene_list_query)"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
" -- Table that contains a list of all distinct genes that map to Reactome\n",
" -- physical entities \n",
" SELECT\n",
" DISTINCT pe.uniprot_id\n",
"\n",
" FROM\n",
" `isb-cgc-bq.reactome_versioned.physical_entity_v77` AS pe\n",
"\n",
" WHERE\n",
" -- filter by pathways that are related to genes in list\n",
" pe.name IN ('RHOT1','MYO7A','ZBTB10','MATK','ST18','RPS23','GCNT1','DROSHA','NUAK1','CCPG1','PDGFD','KLRAP1','MTAP','RNF13','THBS1','MLX','FAP','TIMP3','PRSS1','SLC7A11','OLFML3','RPS20','MCM5','POLE','STEAP4','LRRC8D','WBP1L','ENTPD5','SYNE1','DPT','COPZ2','TRIO','PDPR')\n"
]
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "bY4QCtDfgwBq"
},
"source": [
"We then create a query that counts all targets associated with each Reactome pathway. This query depends on the previous `gene_list_query` sub-query. "
]
},
{
"cell_type": "code",
"metadata": {
"id": "H_3xAQdjhJ9s"
},
"source": [
"gene_pp_query = \"\"\"\n",
" -- Table that maps pathways to the total number of interesting genes within\n",
" -- that pathway\n",
" SELECT\n",
" COUNT(DISTINCT gene_list_query.uniprot_id) as num_genes,\n",
" pathway.stable_id,\n",
" pathway.name\n",
"\n",
" FROM\n",
" gene_list_query\n",
"\n",
" INNER JOIN `isb-cgc-bq.reactome_versioned.physical_entity_v77` AS pe\n",
" -- filter for interactions with genes that match a reactome\n",
" -- physical entity\n",
" ON gene_list_query.uniprot_id = pe.uniprot_id\n",
"\n",
" INNER JOIN `isb-cgc-bq.reactome_versioned.pe_to_pathway_v77` AS pe2pathway\n",
" -- link physical entities to pathways via intermediate table\n",
" ON pe.stable_id = pe2pathway.pe_stable_id\n",
"\n",
" INNER JOIN `isb-cgc-bq.reactome_versioned.pathway_v77` AS pathway\n",
" -- link physical entities to pathways\n",
" ON pe2pathway.pathway_stable_id = pathway.stable_id\n",
"\n",
" WHERE\n",
" -- filter by stronger evidence: \"Traceable Author Statement\" \n",
" pe2pathway.evidence_code = 'TAS'\n",
"\n",
" GROUP BY pathway.stable_id, pathway.name\n",
" ORDER BY num_genes DESC\n",
"\"\"\".strip(\"\\n\")"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "v29oHYEukrzy",
"outputId": "975b1f58-969e-4676-d5b7-45ab0b78694c"
},
"source": [
"print(gene_pp_query)"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
" -- Table that maps pathways to the total number of interesting genes within\n",
" -- that pathway\n",
" SELECT\n",
" COUNT(DISTINCT gene_list_query.uniprot_id) as num_genes,\n",
" pathway.stable_id,\n",
" pathway.name\n",
"\n",
" FROM\n",
" gene_list_query\n",
"\n",
" INNER JOIN `isb-cgc-bq.reactome_versioned.physical_entity_v77` AS pe\n",
" -- filter for interactions with genes that match a reactome\n",
" -- physical entity\n",
" ON gene_list_query.uniprot_id = pe.uniprot_id\n",
"\n",
" INNER JOIN `isb-cgc-bq.reactome_versioned.pe_to_pathway_v77` AS pe2pathway\n",
" -- link physical entities to pathways via intermediate table\n",
" ON pe.stable_id = pe2pathway.pe_stable_id\n",
"\n",
" INNER JOIN `isb-cgc-bq.reactome_versioned.pathway_v77` AS pathway\n",
" -- link physical entities to pathways\n",
" ON pe2pathway.pathway_stable_id = pathway.stable_id\n",
"\n",
" WHERE\n",
" -- filter by stronger evidence: \"Traceable Author Statement\" \n",
" pe2pathway.evidence_code = 'TAS'\n",
"\n",
" GROUP BY pathway.stable_id, pathway.name\n",
" ORDER BY num_genes DESC\n"
]
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "_XXiAJZUlBxC"
},
"source": [
"Now we construct the same queries for genes that are NOT in the interesting gene list. These are prefixed with \"not_gene\". This query depends on the previous `gene_list_query` sub-query. "
]
},
{
"cell_type": "code",
"metadata": {
"id": "1KyR-_WqlVVn"
},
"source": [
"not_gene_list_query = \"\"\"\n",
" -- Table that contains a list of all genes that are NOT in the interest list\n",
" -- This query depends on the previous \"gene_list_query\" sub-query.\n",
" SELECT\n",
" DISTINCT pe.uniprot_id\n",
"\n",
" FROM `isb-cgc-bq.reactome_versioned.physical_entity_v77` AS pe\n",
"\n",
" WHERE\n",
" pe.uniprot_id NOT IN (\n",
" SELECT uniprot_id FROM gene_list_query\n",
" )\n",
"\"\"\".strip(\"\\n\")"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "92MaTzsAn3es",
"outputId": "c50ae34a-1cbf-496e-86b4-ab804bc9c581"
},
"source": [
"print(not_gene_list_query)"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
" -- Table that contains a list of all genes that are NOT in the interest list\n",
" -- This query depends on the previous \"gene_list_query\" sub-query.\n",
" SELECT\n",
" DISTINCT pe.uniprot_id\n",
"\n",
" FROM `isb-cgc-bq.reactome_versioned.physical_entity_v77` AS pe\n",
"\n",
" WHERE\n",
" pe.uniprot_id NOT IN (\n",
" SELECT uniprot_id FROM gene_list_query\n",
" )\n"
]
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "mI1SxkGtn-ow"
},
"source": [
"not_gene_pp_query = \"\"\"\n",
" -- Table that maps pathways to the number of proteins that are NOT drug\n",
" -- targets in that pathway.\n",
" SELECT\n",
" COUNT(DISTINCT not_gene_list_query.uniprot_id) AS num_not_genes,\n",
" pathway.stable_id,\n",
" pathway.name\n",
"\n",
" FROM not_gene_list_query\n",
"\n",
" INNER JOIN `isb-cgc-bq.reactome_versioned.physical_entity_v77` AS pe\n",
" ON not_gene_list_query.uniprot_id = pe.uniprot_id\n",
"\n",
" INNER JOIN `isb-cgc-bq.reactome_versioned.pe_to_pathway_v77` AS pe2pathway\n",
" ON pe.stable_id = pe2pathway.pe_stable_id\n",
" \n",
" INNER JOIN `isb-cgc-bq.reactome_versioned.pathway_v77` AS pathway\n",
" ON pe2pathway.pathway_stable_id = pathway.stable_id\n",
"\n",
" WHERE\n",
" -- filter by stronger evidence: \"Traceable Author Statement\" \n",
" pe2pathway.evidence_code = 'TAS'\n",
"\n",
" GROUP BY pathway.stable_id, pathway.name\n",
" ORDER BY num_not_genes DESC\n",
"\"\"\".strip(\"\\n\")"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "TgY33Z3ukByC",
"outputId": "e9143dce-28de-4f43-ce7a-032c79757486"
},
"source": [
"print(not_gene_pp_query)"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
" -- Table that maps pathways to the number of proteins that are NOT drug\n",
" -- targets in that pathway.\n",
" SELECT\n",
" COUNT(DISTINCT not_gene_list_query.uniprot_id) AS num_not_genes,\n",
" pathway.stable_id,\n",
" pathway.name\n",
"\n",
" FROM not_gene_list_query\n",
"\n",
" INNER JOIN `isb-cgc-bq.reactome_versioned.physical_entity_v77` AS pe\n",
" ON not_gene_list_query.uniprot_id = pe.uniprot_id\n",
"\n",
" INNER JOIN `isb-cgc-bq.reactome_versioned.pe_to_pathway_v77` AS pe2pathway\n",
" ON pe.stable_id = pe2pathway.pe_stable_id\n",
" \n",
" INNER JOIN `isb-cgc-bq.reactome_versioned.pathway_v77` AS pathway\n",
" ON pe2pathway.pathway_stable_id = pathway.stable_id\n",
"\n",
" WHERE\n",
" -- filter by stronger evidence: \"Traceable Author Statement\" \n",
" pe2pathway.evidence_code = 'TAS'\n",
"\n",
" GROUP BY pathway.stable_id, pathway.name\n",
" ORDER BY num_not_genes DESC\n"
]
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Z2eN5_UfVgcX"
},
"source": [
"For convenience, we create a sub-query to get the counts of the number of genes that are in the list and the number of genes that are not in the list."
]
},
{
"cell_type": "code",
"metadata": {
"id": "VAYDIHsdWLFA"
},
"source": [
"gene_count_query = \"\"\"\n",
" -- Table that contains the counts of # of genes that are/are not targets\n",
" SELECT\n",
" gene_count,\n",
" not_gene_count,\n",
" gene_count + not_gene_count AS total_count\n",
"\n",
" FROM \n",
" (SELECT COUNT(*) AS gene_count FROM gene_list_query),\n",
" (SELECT COUNT(*) AS not_gene_count FROM not_gene_list_query)\n",
"\"\"\".strip(\"\\n\")"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "qrt8DcUraJm3"
},
"source": [
"Now we can create more interesting queries for the contingency matrices that contain the observed and expected values. \n",
"\n",
"First, the observed contingency table counts for each pathway:"
]
},
{
"cell_type": "code",
"metadata": {
"id": "8Dc-wMSZakad"
},
"source": [
"observed_query = \"\"\"\n",
" -- Table with observed values per pathway in the contingency matrix\n",
" SELECT\n",
" gene_pp_query.num_genes AS in_gene_in_pathway,\n",
" not_gene_pp_query.num_not_genes AS not_gene_in_pathway,\n",
" gene_count_query.gene_count - gene_pp_query.num_genes AS in_gene_not_pathway,\n",
" gene_count_query.not_gene_count - not_gene_pp_query.num_not_genes AS not_gene_not_pathway,\n",
" gene_pp_query.stable_id,\n",
" gene_pp_query.name\n",
"\n",
" FROM \n",
" gene_pp_query,\n",
" gene_count_query\n",
"\n",
" INNER JOIN not_gene_pp_query\n",
" ON gene_pp_query.stable_id = not_gene_pp_query.stable_id\n",
"\"\"\".strip(\"\\n\")"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "STbl3Toxcq7B"
},
"source": [
"Then the observed row and column sums of the contingency table:"
]
},
{
"cell_type": "code",
"metadata": {
"id": "4kXNQ4SRcmPK"
},
"source": [
"sum_query = \"\"\"\n",
" -- Table with summed observed values per pathway in the contingency matrix\n",
" SELECT\n",
" observed_query.in_gene_in_pathway + observed_query.not_gene_in_pathway AS pathway_total,\n",
" observed_query.in_gene_not_pathway + observed_query.not_gene_not_pathway AS not_pathway_total,\n",
" observed_query.in_gene_in_pathway + observed_query.in_gene_not_pathway AS gene_total,\n",
" observed_query.not_gene_in_pathway + observed_query.not_gene_not_pathway AS not_gene_total,\n",
" observed_query.stable_id,\n",
" observed_query.name\n",
"\n",
" FROM\n",
" observed_query\n",
"\"\"\".strip(\"\\n\")"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "6HiQlxAudQEk"
},
"source": [
"And the expected contingency table values for each pathway:"
]
},
{
"cell_type": "code",
"metadata": {
"id": "V0Rj098QdWe6"
},
"source": [
"expected_query = \"\"\" \n",
" -- Table with the expected values per pathway in the contingency matrix\n",
" SELECT \n",
" sum_query.gene_total * sum_query.pathway_total / gene_count_query.total_count AS exp_in_gene_in_pathway,\n",
" sum_query.not_gene_total * sum_query.pathway_total / gene_count_query.total_count AS exp_not_gene_in_pathway,\n",
" sum_query.gene_total * sum_query.not_pathway_total / gene_count_query.total_count AS exp_in_gene_not_pathway,\n",
" sum_query.not_gene_total * sum_query.not_pathway_total / gene_count_query.total_count AS exp_not_gene_not_pathway,\n",
" sum_query.stable_id,\n",
" sum_query.name\n",
"\n",
" FROM \n",
" sum_query, gene_count_query\n",
"\"\"\".strip(\"\\n\")"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "wC57CDA6d_4H"
},
"source": [
"Finally, we can calculate the chi-squared statistic for each pathway:"
]
},
{
"cell_type": "code",
"metadata": {
"id": "kxlrSyiTd_Ia"
},
"source": [
"chi_squared_query = \"\"\"\n",
" -- Table with the chi-squared statistic for each pathway\n",
" SELECT\n",
" -- Chi squared statistic with Yates' correction\n",
" POW(ABS(observed_query.in_gene_in_pathway - expected_query.exp_in_gene_in_pathway) - 0.5, 2) / expected_query.exp_in_gene_in_pathway \n",
" + POW(ABS(observed_query.not_gene_in_pathway - expected_query.exp_not_gene_in_pathway) - 0.5, 2) / expected_query.exp_not_gene_in_pathway\n",
" + POW(ABS(observed_query.in_gene_not_pathway - expected_query.exp_in_gene_not_pathway) - 0.5, 2) / expected_query.exp_in_gene_not_pathway\n",
" + POW(ABS(observed_query.not_gene_not_pathway - expected_query.exp_not_gene_not_pathway) - 0.5, 2) / expected_query.exp_not_gene_not_pathway\n",
" AS chi_squared_stat,\n",
" observed_query.stable_id,\n",
" observed_query.name\n",
"\n",
" FROM observed_query\n",
"\n",
" INNER JOIN expected_query\n",
" ON observed_query.stable_id = expected_query.stable_id\n",
"\"\"\".strip(\"\\n\")"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "lTRfNbLKuarY"
},
"source": [
"The final piece of the query optionally adds a filter that removes all pathways that are not at the lowest level of the hierarchy. This helps remove non-specific \"pathways\" such as the generic \"Disease\" pathway."
]
},
{
"cell_type": "code",
"metadata": {
"id": "IBpHKWxjvCSx"
},
"source": [
"lowest_level_filter = \"\"\"\n",
" INNER JOIN `isb-cgc-bq.reactome_versioned.pathway_v77` AS pathway\n",
" ON chi_squared_query.stable_id = pathway.stable_id\n",
"\n",
" WHERE pathway.lowest_level = TRUE\n",
"\"\"\".strip(\"\\n\")"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "V_Qx1J2IvrBT",
"outputId": "04ee2a9a-a1df-416f-fef6-83cb0bc23671"
},
"source": [
"print(lowest_level_filter)"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
" INNER JOIN `isb-cgc-bq.reactome_versioned.pathway_v77` AS pathway\n",
" ON chi_squared_query.stable_id = pathway.stable_id\n",
"\n",
" WHERE pathway.lowest_level = TRUE\n"
]
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "QsJQyREafSZE"
},
"source": [
"Now we can combine all sub-queries to create the final query:"
]
},
{
"cell_type": "code",
"metadata": {
"id": "dOXXXAmkfcnb"
},
"source": [
"final_query = \"\"\"\n",
" WITH\n",
" gene_list_query AS (\n",
" {gene_list_query}\n",
" ),\n",
"\n",
" gene_pp_query AS (\n",
" {gene_pp_query}\n",
" ),\n",
"\n",
" not_gene_list_query AS (\n",
" {not_gene_list_query}\n",
" ),\n",
"\n",
" not_gene_pp_query AS (\n",
" {not_gene_pp_query}\n",
" ),\n",
"\n",
" gene_count_query AS (\n",
" {gene_count_query}\n",
" ),\n",
"\n",
" observed_query AS (\n",
" {observed_query}\n",
" ),\n",
"\n",
" sum_query AS (\n",
" {sum_query}\n",
" ),\n",
"\n",
" expected_query AS (\n",
" {expected_query}\n",
" ),\n",
" \n",
" chi_squared_query AS (\n",
" {chi_squared_query}\n",
" )\n",
"\n",
" SELECT\n",
" observed_query.in_gene_in_pathway,\n",
" observed_query.in_gene_not_pathway,\n",
" observed_query.not_gene_in_pathway,\n",
" observed_query.not_gene_not_pathway,\n",
" chi_squared_query.chi_squared_stat,\n",
" chi_squared_query.stable_id,\n",
" chi_squared_query.name\n",
"\n",
" FROM chi_squared_query\n",
"\n",
" INNER JOIN observed_query\n",
" ON chi_squared_query.stable_id = observed_query.stable_id\n",
" {lowest_level_filter}\n",
" ORDER BY chi_squared_stat DESC\n",
"\"\"\".format(\n",
" # make final query a little easier to read by removing/adding some white space \n",
" gene_list_query=\"\\n \".join(gene_list_query.strip().splitlines()),\n",
" gene_pp_query=\"\\n \".join(gene_pp_query.strip().splitlines()),\n",
" not_gene_list_query=\"\\n \".join(not_gene_list_query.strip().splitlines()),\n",
" not_gene_pp_query=\"\\n \".join(not_gene_pp_query.strip().splitlines()),\n",
" gene_count_query=\"\\n \".join(gene_count_query.strip().splitlines()),\n",
" observed_query=\"\\n \".join(observed_query.strip().splitlines()),\n",
" sum_query=\"\\n \".join(sum_query.strip().splitlines()),\n",
" expected_query=\"\\n \".join(expected_query.strip().splitlines()),\n",
" chi_squared_query=\"\\n \".join(chi_squared_query.strip().splitlines()),\n",
" lowest_level_filter=(\n",
" \"\\n \"+\"\\n\".join(lowest_level_filter.strip().splitlines())+\"\\n\" if lowest_level else \"\"\n",
" )\n",
").strip(\"\\n\")"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "YEjfpM9Q4xsz"
},
"source": [
"## Display Final Query"
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "MfitLyzmgq_P",
"outputId": "b554ac13-3c67-4639-9993-e1df53834ee5"
},
"source": [
"# print the formatted final query\n",
"print(final_query)"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
" WITH\n",
" gene_list_query AS (\n",
" -- Table that contains a list of all distinct genes that map to Reactome\n",
" -- physical entities \n",
" SELECT\n",
" DISTINCT pe.uniprot_id\n",
" \n",
" FROM\n",
" `isb-cgc-bq.reactome_versioned.physical_entity_v77` AS pe\n",
" \n",
" WHERE\n",
" -- filter by pathways that are related to genes in list\n",
" pe.name IN ('RHOT1','MYO7A','ZBTB10','MATK','ST18','RPS23','GCNT1','DROSHA','NUAK1','CCPG1','PDGFD','KLRAP1','MTAP','RNF13','THBS1','MLX','FAP','TIMP3','PRSS1','SLC7A11','OLFML3','RPS20','MCM5','POLE','STEAP4','LRRC8D','WBP1L','ENTPD5','SYNE1','DPT','COPZ2','TRIO','PDPR')\n",
" ),\n",
"\n",
" gene_pp_query AS (\n",
" -- Table that maps pathways to the total number of interesting genes within\n",
" -- that pathway\n",
" SELECT\n",
" COUNT(DISTINCT gene_list_query.uniprot_id) as num_genes,\n",
" pathway.stable_id,\n",
" pathway.name\n",
" \n",
" FROM\n",
" gene_list_query\n",
" \n",
" INNER JOIN `isb-cgc-bq.reactome_versioned.physical_entity_v77` AS pe\n",
" -- filter for interactions with genes that match a reactome\n",
" -- physical entity\n",
" ON gene_list_query.uniprot_id = pe.uniprot_id\n",
" \n",
" INNER JOIN `isb-cgc-bq.reactome_versioned.pe_to_pathway_v77` AS pe2pathway\n",
" -- link physical entities to pathways via intermediate table\n",
" ON pe.stable_id = pe2pathway.pe_stable_id\n",
" \n",
" INNER JOIN `isb-cgc-bq.reactome_versioned.pathway_v77` AS pathway\n",
" -- link physical entities to pathways\n",
" ON pe2pathway.pathway_stable_id = pathway.stable_id\n",
" \n",
" WHERE\n",
" -- filter by stronger evidence: \"Traceable Author Statement\" \n",
" pe2pathway.evidence_code = 'TAS'\n",
" \n",
" GROUP BY pathway.stable_id, pathway.name\n",
" ORDER BY num_genes DESC\n",
" ),\n",
"\n",
" not_gene_list_query AS (\n",
" -- Table that contains a list of all genes that are NOT in the interest list\n",
" -- This query depends on the previous \"gene_list_query\" sub-query.\n",
" SELECT\n",
" DISTINCT pe.uniprot_id\n",
" \n",
" FROM `isb-cgc-bq.reactome_versioned.physical_entity_v77` AS pe\n",
" \n",
" WHERE\n",
" pe.uniprot_id NOT IN (\n",
" SELECT uniprot_id FROM gene_list_query\n",
" )\n",
" ),\n",
"\n",
" not_gene_pp_query AS (\n",
" -- Table that maps pathways to the number of proteins that are NOT drug\n",
" -- targets in that pathway.\n",
" SELECT\n",
" COUNT(DISTINCT not_gene_list_query.uniprot_id) AS num_not_genes,\n",
" pathway.stable_id,\n",
" pathway.name\n",
" \n",
" FROM not_gene_list_query\n",
" \n",
" INNER JOIN `isb-cgc-bq.reactome_versioned.physical_entity_v77` AS pe\n",
" ON not_gene_list_query.uniprot_id = pe.uniprot_id\n",
" \n",
" INNER JOIN `isb-cgc-bq.reactome_versioned.pe_to_pathway_v77` AS pe2pathway\n",
" ON pe.stable_id = pe2pathway.pe_stable_id\n",
" \n",
" INNER JOIN `isb-cgc-bq.reactome_versioned.pathway_v77` AS pathway\n",
" ON pe2pathway.pathway_stable_id = pathway.stable_id\n",
" \n",
" WHERE\n",
" -- filter by stronger evidence: \"Traceable Author Statement\" \n",
" pe2pathway.evidence_code = 'TAS'\n",
" \n",
" GROUP BY pathway.stable_id, pathway.name\n",
" ORDER BY num_not_genes DESC\n",
" ),\n",
"\n",
" gene_count_query AS (\n",
" -- Table that contains the counts of # of genes that are/are not targets\n",
" SELECT\n",
" gene_count,\n",
" not_gene_count,\n",
" gene_count + not_gene_count AS total_count\n",
" \n",
" FROM \n",
" (SELECT COUNT(*) AS gene_count FROM gene_list_query),\n",
" (SELECT COUNT(*) AS not_gene_count FROM not_gene_list_query)\n",
" ),\n",
"\n",
" observed_query AS (\n",
" -- Table with observed values per pathway in the contingency matrix\n",
" SELECT\n",
" gene_pp_query.num_genes AS in_gene_in_pathway,\n",
" not_gene_pp_query.num_not_genes AS not_gene_in_pathway,\n",
" gene_count_query.gene_count - gene_pp_query.num_genes AS in_gene_not_pathway,\n",
" gene_count_query.not_gene_count - not_gene_pp_query.num_not_genes AS not_gene_not_pathway,\n",
" gene_pp_query.stable_id,\n",
" gene_pp_query.name\n",
" \n",
" FROM \n",
" gene_pp_query,\n",
" gene_count_query\n",
" \n",
" INNER JOIN not_gene_pp_query\n",
" ON gene_pp_query.stable_id = not_gene_pp_query.stable_id\n",
" ),\n",
"\n",
" sum_query AS (\n",
" -- Table with summed observed values per pathway in the contingency matrix\n",
" SELECT\n",
" observed_query.in_gene_in_pathway + observed_query.not_gene_in_pathway AS pathway_total,\n",
" observed_query.in_gene_not_pathway + observed_query.not_gene_not_pathway AS not_pathway_total,\n",
" observed_query.in_gene_in_pathway + observed_query.in_gene_not_pathway AS gene_total,\n",
" observed_query.not_gene_in_pathway + observed_query.not_gene_not_pathway AS not_gene_total,\n",
" observed_query.stable_id,\n",
" observed_query.name\n",
" \n",
" FROM\n",
" observed_query\n",
" ),\n",
"\n",
" expected_query AS (\n",
" -- Table with the expected values per pathway in the contingency matrix\n",
" SELECT \n",
" sum_query.gene_total * sum_query.pathway_total / gene_count_query.total_count AS exp_in_gene_in_pathway,\n",
" sum_query.not_gene_total * sum_query.pathway_total / gene_count_query.total_count AS exp_not_gene_in_pathway,\n",
" sum_query.gene_total * sum_query.not_pathway_total / gene_count_query.total_count AS exp_in_gene_not_pathway,\n",
" sum_query.not_gene_total * sum_query.not_pathway_total / gene_count_query.total_count AS exp_not_gene_not_pathway,\n",
" sum_query.stable_id,\n",
" sum_query.name\n",
" \n",
" FROM \n",
" sum_query, gene_count_query\n",
" ),\n",
" \n",
" chi_squared_query AS (\n",
" -- Table with the chi-squared statistic for each pathway\n",
" SELECT\n",
" -- Chi squared statistic with Yates' correction\n",
" POW(ABS(observed_query.in_gene_in_pathway - expected_query.exp_in_gene_in_pathway) - 0.5, 2) / expected_query.exp_in_gene_in_pathway \n",
" + POW(ABS(observed_query.not_gene_in_pathway - expected_query.exp_not_gene_in_pathway) - 0.5, 2) / expected_query.exp_not_gene_in_pathway\n",
" + POW(ABS(observed_query.in_gene_not_pathway - expected_query.exp_in_gene_not_pathway) - 0.5, 2) / expected_query.exp_in_gene_not_pathway\n",
" + POW(ABS(observed_query.not_gene_not_pathway - expected_query.exp_not_gene_not_pathway) - 0.5, 2) / expected_query.exp_not_gene_not_pathway\n",
" AS chi_squared_stat,\n",
" observed_query.stable_id,\n",
" observed_query.name\n",
" \n",
" FROM observed_query\n",
" \n",
" INNER JOIN expected_query\n",
" ON observed_query.stable_id = expected_query.stable_id\n",
" )\n",
"\n",
" SELECT\n",
" observed_query.in_gene_in_pathway,\n",
" observed_query.in_gene_not_pathway,\n",
" observed_query.not_gene_in_pathway,\n",
" observed_query.not_gene_not_pathway,\n",
" chi_squared_query.chi_squared_stat,\n",
" chi_squared_query.stable_id,\n",
" chi_squared_query.name\n",
"\n",
" FROM chi_squared_query\n",
"\n",
" INNER JOIN observed_query\n",
" ON chi_squared_query.stable_id = observed_query.stable_id\n",
" \n",
" INNER JOIN `isb-cgc-bq.reactome_versioned.pathway_v77` AS pathway\n",
" ON chi_squared_query.stable_id = pathway.stable_id\n",
"\n",
" WHERE pathway.lowest_level = TRUE\n",
"\n",
" ORDER BY chi_squared_stat DESC\n"
]
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Fjkbq4FR19in"
},
"source": [
"## Execute the Query\n",
"\n",
"Now execute the query to calculate a chi-squared statistic for each pathway:"
]
},
{
"cell_type": "code",
"metadata": {
"id": "Ua5oYRFY1aVu"
},
"source": [
"# run query and put results in data frame\n",
"chi_squared_pathways = client.query(final_query).result().to_dataframe()"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "9Yt-GatY2awc",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 1000
},
"outputId": "3c62ac76-7a84-4ae3-89a2-9ea7e35f2737"
},
"source": [
"# display the data frame\n",
"chi_squared_pathways"
],
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" in_gene_in_pathway | \n",
" in_gene_not_pathway | \n",
" not_gene_in_pathway | \n",
" not_gene_not_pathway | \n",
" chi_squared_stat | \n",
" stable_id | \n",
" name | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 2 | \n",
" 19 | \n",
" 31 | \n",
" 11114 | \n",
" 33.477023 | \n",
" R-HSA-68962 | \n",
" Activation of the pre-replicative complex | \n",
"
\n",
" \n",
" | 1 | \n",
" 1 | \n",
" 20 | \n",
" 3 | \n",
" 11142 | \n",
" 32.311989 | \n",
" R-HSA-1237112 | \n",
" Methionine salvage pathway | \n",
"
\n",
" \n",
" | 2 | \n",
" 1 | \n",
" 20 | \n",
" 3 | \n",
" 11142 | \n",
" 32.311989 | \n",
" R-HSA-9013425 | \n",
" RHOT1 GTPase cycle | \n",
"
\n",
" \n",
" | 3 | \n",
" 2 | \n",
" 19 | \n",
" 50 | \n",
" 11095 | \n",
" 20.236790 | \n",
" R-HSA-72695 | \n",
" Formation of the ternary complex, and subseque... | \n",
"
\n",
" \n",
" | 4 | \n",
" 1 | \n",
" 20 | \n",
" 6 | \n",
" 11139 | \n",
" 18.048197 | \n",
" R-HSA-163765 | \n",
" ChREBP activates metabolic gene expression | \n",
"
\n",
" \n",
" | 5 | \n",
" 2 | \n",
" 19 | \n",
" 57 | \n",
" 11088 | \n",
" 17.513505 | \n",
" R-HSA-72649 | \n",
" Translation initiation complex formation | \n",
"
\n",
" \n",
" | 6 | \n",
" 2 | \n",
" 19 | \n",
" 57 | \n",
" 11088 | \n",
" 17.513505 | \n",
" R-HSA-72702 | \n",
" Ribosomal scanning and start codon recognition | \n",
"
\n",
" \n",
" | 7 | \n",
" 1 | \n",
" 20 | \n",
" 7 | \n",
" 11138 | \n",
" 15.671798 | \n",
" R-HSA-8850843 | \n",
" Phosphate bond hydrolysis by NTPDase proteins | \n",
"
\n",
" \n",
" | 8 | \n",
" 1 | \n",
" 20 | \n",
" 7 | \n",
" 11138 | \n",
" 15.671798 | \n",
" R-HSA-68952 | \n",
" DNA replication initiation | \n",
"
\n",
" \n",
" | 9 | \n",
" 1 | \n",
" 20 | \n",
" 10 | \n",
" 11135 | \n",
" 11.137000 | \n",
" R-HSA-418885 | \n",
" DCC mediated attractive signaling | \n",
"
\n",
" \n",
" | 10 | \n",
" 2 | \n",
" 19 | \n",
" 88 | \n",
" 11057 | \n",
" 10.567006 | \n",
" R-HSA-192823 | \n",
" Viral mRNA Translation | \n",
"
\n",
" \n",
" | 11 | \n",
" 2 | \n",
" 19 | \n",
" 92 | \n",
" 11053 | \n",
" 10.006894 | \n",
" R-HSA-1799339 | \n",
" SRP-dependent cotranslational protein targetin... | \n",
"
\n",
" \n",
" | 12 | \n",
" 2 | \n",
" 19 | \n",
" 94 | \n",
" 11051 | \n",
" 9.744550 | \n",
" R-HSA-975956 | \n",
" Nonsense Mediated Decay (NMD) independent of t... | \n",
"
\n",
" \n",
" | 13 | \n",
" 2 | \n",
" 19 | \n",
" 100 | \n",
" 11045 | \n",
" 9.020030 | \n",
" R-HSA-72689 | \n",
" Formation of a pool of free 40S subunits | \n",
"
\n",
" \n",
" | 14 | \n",
" 2 | \n",
" 19 | \n",
" 110 | \n",
" 11035 | \n",
" 7.987388 | \n",
" R-HSA-156827 | \n",
" L13a-mediated translational silencing of Cerul... | \n",
"
\n",
" \n",
" | 15 | \n",
" 2 | \n",
" 19 | \n",
" 111 | \n",
" 11034 | \n",
" 7.894339 | \n",
" R-HSA-72706 | \n",
" GTP hydrolysis and joining of the 60S ribosoma... | \n",
"
\n",
" \n",
" | 16 | \n",
" 2 | \n",
" 19 | \n",
" 115 | \n",
" 11030 | \n",
" 7.538335 | \n",
" R-HSA-975957 | \n",
" Nonsense Mediated Decay (NMD) enhanced by the ... | \n",
"
\n",
" \n",
" | 17 | \n",
" 1 | \n",
" 20 | \n",
" 15 | \n",
" 11130 | \n",
" 7.362504 | \n",
" R-HSA-204174 | \n",
" Regulation of pyruvate dehydrogenase (PDH) com... | \n",
"
\n",
" \n",
" | 18 | \n",
" 2 | \n",
" 19 | \n",
" 119 | \n",
" 11026 | \n",
" 7.206312 | \n",
" R-HSA-114608 | \n",
" Platelet degranulation | \n",
"
\n",
" \n",
" | 19 | \n",
" 1 | \n",
" 20 | \n",
" 20 | \n",
" 11125 | \n",
" 5.389681 | \n",
" R-HSA-5651801 | \n",
" PCNA-Dependent Long Patch Base Excision Repair | \n",
"
\n",
" \n",
" | 20 | \n",
" 1 | \n",
" 20 | \n",
" 23 | \n",
" 11122 | \n",
" 4.602357 | \n",
" R-HSA-203927 | \n",
" MicroRNA (miRNA) biogenesis | \n",
"
\n",
" \n",
" | 21 | \n",
" 1 | \n",
" 20 | \n",
" 24 | \n",
" 11121 | \n",
" 4.382195 | \n",
" R-HSA-9660826 | \n",
" Purinergic signaling in leishmaniasis infection | \n",
"
\n",
" \n",
" | 22 | \n",
" 1 | \n",
" 20 | \n",
" 24 | \n",
" 11121 | \n",
" 4.382195 | \n",
" R-HSA-210991 | \n",
" Basigin interactions | \n",
"
\n",
" \n",
" | 23 | \n",
" 1 | \n",
" 20 | \n",
" 24 | \n",
" 11121 | \n",
" 4.382195 | \n",
" R-HSA-5223345 | \n",
" Miscellaneous transport and binding events | \n",
"
\n",
" \n",
" | 24 | \n",
" 1 | \n",
" 20 | \n",
" 24 | \n",
" 11121 | \n",
" 4.382195 | \n",
" R-HSA-5696397 | \n",
" Gap-filling DNA repair synthesis and ligation ... | \n",
"
\n",
" \n",
" | 25 | \n",
" 2 | \n",
" 19 | \n",
" 181 | \n",
" 10964 | \n",
" 3.953753 | \n",
" R-HSA-6791226 | \n",
" Major pathway of rRNA processing in the nucleo... | \n",
"
\n",
" \n",
" | 26 | \n",
" 1 | \n",
" 20 | \n",
" 29 | \n",
" 11116 | \n",
" 3.503357 | \n",
" R-HSA-110314 | \n",
" Recognition of DNA damage by PCNA-containing r... | \n",
"
\n",
" \n",
" | 27 | \n",
" 1 | \n",
" 20 | \n",
" 31 | \n",
" 11114 | \n",
" 3.229512 | \n",
" R-HSA-5656169 | \n",
" Termination of translesion DNA synthesis | \n",
"
\n",
" \n",
" | 28 | \n",
" 1 | \n",
" 20 | \n",
" 33 | \n",
" 11112 | \n",
" 2.988310 | \n",
" R-HSA-352230 | \n",
" Amino acid transport across the plasma membrane | \n",
"
\n",
" \n",
" | 29 | \n",
" 1 | \n",
" 20 | \n",
" 36 | \n",
" 11109 | \n",
" 2.676134 | \n",
" R-HSA-176187 | \n",
" Activation of ATR in response to replication s... | \n",
"
\n",
" \n",
" | 30 | \n",
" 1 | \n",
" 20 | \n",
" 36 | \n",
" 11109 | \n",
" 2.676134 | \n",
" R-HSA-8950505 | \n",
" Gene and protein expression by JAK-STAT signal... | \n",
"
\n",
" \n",
" | 31 | \n",
" 1 | \n",
" 20 | \n",
" 37 | \n",
" 11108 | \n",
" 2.583220 | \n",
" R-HSA-5083635 | \n",
" Defective B3GALTL causes Peters-plus syndrome ... | \n",
"
\n",
" \n",
" | 32 | \n",
" 1 | \n",
" 20 | \n",
" 38 | \n",
" 11107 | \n",
" 2.495164 | \n",
" R-HSA-5173214 | \n",
" O-glycosylation of TSR domain-containing proteins | \n",
"
\n",
" \n",
" | 33 | \n",
" 1 | \n",
" 20 | \n",
" 40 | \n",
" 11105 | \n",
" 2.332203 | \n",
" R-HSA-5696400 | \n",
" Dual Incision in GG-NER | \n",
"
\n",
" \n",
" | 34 | \n",
" 1 | \n",
" 20 | \n",
" 54 | \n",
" 11091 | \n",
" 1.530727 | \n",
" R-HSA-9013409 | \n",
" RHOJ GTPase cycle | \n",
"
\n",
" \n",
" | 35 | \n",
" 1 | \n",
" 20 | \n",
" 54 | \n",
" 11091 | \n",
" 1.530727 | \n",
" R-HSA-193648 | \n",
" NRAGE signals death through JNK | \n",
"
\n",
" \n",
" | 36 | \n",
" 1 | \n",
" 20 | \n",
" 63 | \n",
" 11082 | \n",
" 1.206546 | \n",
" R-HSA-6782210 | \n",
" Gap-filling DNA repair synthesis and ligation ... | \n",
"
\n",
" \n",
" | 37 | \n",
" 1 | \n",
" 20 | \n",
" 64 | \n",
" 11081 | \n",
" 1.176348 | \n",
" R-HSA-6782135 | \n",
" Dual incision in TC-NER | \n",
"
\n",
" \n",
" | 38 | \n",
" 1 | \n",
" 20 | \n",
" 65 | \n",
" 11080 | \n",
" 1.147121 | \n",
" R-HSA-8936459 | \n",
" RUNX1 regulates genes involved in megakaryocyt... | \n",
"
\n",
" \n",
" | 39 | \n",
" 1 | \n",
" 20 | \n",
" 70 | \n",
" 11075 | \n",
" 1.014107 | \n",
" R-HSA-68949 | \n",
" Orc1 removal from chromatin | \n",
"
\n",
" \n",
" | 40 | \n",
" 1 | \n",
" 20 | \n",
" 73 | \n",
" 11072 | \n",
" 0.943520 | \n",
" R-HSA-9013408 | \n",
" RHOG GTPase cycle | \n",
"
\n",
" \n",
" | 41 | \n",
" 1 | \n",
" 20 | \n",
" 79 | \n",
" 11066 | \n",
" 0.819465 | \n",
" R-HSA-416482 | \n",
" G alpha (12/13) signalling events | \n",
"
\n",
" \n",
" | 42 | \n",
" 1 | \n",
" 20 | \n",
" 80 | \n",
" 11065 | \n",
" 0.800735 | \n",
" R-HSA-216083 | \n",
" Integrin cell surface interactions | \n",
"
\n",
" \n",
" | 43 | \n",
" 1 | \n",
" 20 | \n",
" 87 | \n",
" 11058 | \n",
" 0.682709 | \n",
" R-HSA-9013404 | \n",
" RAC2 GTPase cycle | \n",
"
\n",
" \n",
" | 44 | \n",
" 1 | \n",
" 20 | \n",
" 91 | \n",
" 11054 | \n",
" 0.624208 | \n",
" R-HSA-6804756 | \n",
" Regulation of TP53 Activity through Phosphoryl... | \n",
"
\n",
" \n",
" | 45 | \n",
" 1 | \n",
" 20 | \n",
" 93 | \n",
" 11052 | \n",
" 0.597060 | \n",
" R-HSA-9013423 | \n",
" RAC3 GTPase cycle | \n",
"
\n",
" \n",
" | 46 | \n",
" 1 | \n",
" 20 | \n",
" 99 | \n",
" 11046 | \n",
" 0.523015 | \n",
" R-HSA-6811434 | \n",
" COPI-dependent Golgi-to-ER retrograde traffic | \n",
"
\n",
" \n",
" | 47 | \n",
" 1 | \n",
" 20 | \n",
" 101 | \n",
" 11044 | \n",
" 0.500559 | \n",
" R-HSA-6807878 | \n",
" COPI-mediated anterograde transport | \n",
"
\n",
" \n",
" | 48 | \n",
" 1 | \n",
" 20 | \n",
" 149 | \n",
" 10996 | \n",
" 0.170910 | \n",
" R-HSA-8980692 | \n",
" RHOA GTPase cycle | \n",
"
\n",
" \n",
" | 49 | \n",
" 1 | \n",
" 20 | \n",
" 158 | \n",
" 10987 | \n",
" 0.137271 | \n",
" R-HSA-9013148 | \n",
" CDC42 GTPase cycle | \n",
"
\n",
" \n",
" | 50 | \n",
" 1 | \n",
" 20 | \n",
" 184 | \n",
" 10961 | \n",
" 0.067711 | \n",
" R-HSA-9013149 | \n",
" RAC1 GTPase cycle | \n",
"
\n",
" \n",
" | 51 | \n",
" 1 | \n",
" 20 | \n",
" 204 | \n",
" 10941 | \n",
" 0.034678 | \n",
" R-HSA-5689880 | \n",
" Ub-specific processing proteases | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" in_gene_in_pathway ... name\n",
"0 2 ... Activation of the pre-replicative complex\n",
"1 1 ... Methionine salvage pathway\n",
"2 1 ... RHOT1 GTPase cycle\n",
"3 2 ... Formation of the ternary complex, and subseque...\n",
"4 1 ... ChREBP activates metabolic gene expression\n",
"5 2 ... Translation initiation complex formation\n",
"6 2 ... Ribosomal scanning and start codon recognition\n",
"7 1 ... Phosphate bond hydrolysis by NTPDase proteins\n",
"8 1 ... DNA replication initiation\n",
"9 1 ... DCC mediated attractive signaling\n",
"10 2 ... Viral mRNA Translation\n",
"11 2 ... SRP-dependent cotranslational protein targetin...\n",
"12 2 ... Nonsense Mediated Decay (NMD) independent of t...\n",
"13 2 ... Formation of a pool of free 40S subunits\n",
"14 2 ... L13a-mediated translational silencing of Cerul...\n",
"15 2 ... GTP hydrolysis and joining of the 60S ribosoma...\n",
"16 2 ... Nonsense Mediated Decay (NMD) enhanced by the ...\n",
"17 1 ... Regulation of pyruvate dehydrogenase (PDH) com...\n",
"18 2 ... Platelet degranulation \n",
"19 1 ... PCNA-Dependent Long Patch Base Excision Repair\n",
"20 1 ... MicroRNA (miRNA) biogenesis\n",
"21 1 ... Purinergic signaling in leishmaniasis infection\n",
"22 1 ... Basigin interactions\n",
"23 1 ... Miscellaneous transport and binding events\n",
"24 1 ... Gap-filling DNA repair synthesis and ligation ...\n",
"25 2 ... Major pathway of rRNA processing in the nucleo...\n",
"26 1 ... Recognition of DNA damage by PCNA-containing r...\n",
"27 1 ... Termination of translesion DNA synthesis\n",
"28 1 ... Amino acid transport across the plasma membrane\n",
"29 1 ... Activation of ATR in response to replication s...\n",
"30 1 ... Gene and protein expression by JAK-STAT signal...\n",
"31 1 ... Defective B3GALTL causes Peters-plus syndrome ...\n",
"32 1 ... O-glycosylation of TSR domain-containing proteins\n",
"33 1 ... Dual Incision in GG-NER\n",
"34 1 ... RHOJ GTPase cycle\n",
"35 1 ... NRAGE signals death through JNK\n",
"36 1 ... Gap-filling DNA repair synthesis and ligation ...\n",
"37 1 ... Dual incision in TC-NER\n",
"38 1 ... RUNX1 regulates genes involved in megakaryocyt...\n",
"39 1 ... Orc1 removal from chromatin\n",
"40 1 ... RHOG GTPase cycle\n",
"41 1 ... G alpha (12/13) signalling events\n",
"42 1 ... Integrin cell surface interactions\n",
"43 1 ... RAC2 GTPase cycle\n",
"44 1 ... Regulation of TP53 Activity through Phosphoryl...\n",
"45 1 ... RAC3 GTPase cycle\n",
"46 1 ... COPI-dependent Golgi-to-ER retrograde traffic\n",
"47 1 ... COPI-mediated anterograde transport\n",
"48 1 ... RHOA GTPase cycle\n",
"49 1 ... CDC42 GTPase cycle\n",
"50 1 ... RAC1 GTPase cycle\n",
"51 1 ... Ub-specific processing proteases\n",
"\n",
"[52 rows x 7 columns]"
]
},
"metadata": {},
"execution_count": 29
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "qM42ztRJDgbM"
},
"source": [
"## Calculate P-Values\n",
"\n",
"BigQuery does not have statistical functions to calculate p-values, so we use the SciPy stats library and update the data frame with a new p-value column:"
]
},
{
"cell_type": "code",
"metadata": {
"id": "Beim_ePN2cs7",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 1000
},
"outputId": "6f95818a-7c14-451a-8549-eed4e9598983"
},
"source": [
"chi_squared_pathways['p_value'] = 1-stats.chi2.cdf(chi_squared_pathways['chi_squared_stat'], 1)\n",
"chi_squared_pathways"
],
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" in_gene_in_pathway | \n",
" in_gene_not_pathway | \n",
" not_gene_in_pathway | \n",
" not_gene_not_pathway | \n",
" chi_squared_stat | \n",
" stable_id | \n",
" name | \n",
" p_value | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 2 | \n",
" 19 | \n",
" 31 | \n",
" 11114 | \n",
" 33.477023 | \n",
" R-HSA-68962 | \n",
" Activation of the pre-replicative complex | \n",
" 7.211088e-09 | \n",
"
\n",
" \n",
" | 1 | \n",
" 1 | \n",
" 20 | \n",
" 3 | \n",
" 11142 | \n",
" 32.311989 | \n",
" R-HSA-1237112 | \n",
" Methionine salvage pathway | \n",
" 1.313007e-08 | \n",
"
\n",
" \n",
" | 2 | \n",
" 1 | \n",
" 20 | \n",
" 3 | \n",
" 11142 | \n",
" 32.311989 | \n",
" R-HSA-9013425 | \n",
" RHOT1 GTPase cycle | \n",
" 1.313007e-08 | \n",
"
\n",
" \n",
" | 3 | \n",
" 2 | \n",
" 19 | \n",
" 50 | \n",
" 11095 | \n",
" 20.236790 | \n",
" R-HSA-72695 | \n",
" Formation of the ternary complex, and subseque... | \n",
" 6.842431e-06 | \n",
"
\n",
" \n",
" | 4 | \n",
" 1 | \n",
" 20 | \n",
" 6 | \n",
" 11139 | \n",
" 18.048197 | \n",
" R-HSA-163765 | \n",
" ChREBP activates metabolic gene expression | \n",
" 2.153825e-05 | \n",
"
\n",
" \n",
" | 5 | \n",
" 2 | \n",
" 19 | \n",
" 57 | \n",
" 11088 | \n",
" 17.513505 | \n",
" R-HSA-72649 | \n",
" Translation initiation complex formation | \n",
" 2.852741e-05 | \n",
"
\n",
" \n",
" | 6 | \n",
" 2 | \n",
" 19 | \n",
" 57 | \n",
" 11088 | \n",
" 17.513505 | \n",
" R-HSA-72702 | \n",
" Ribosomal scanning and start codon recognition | \n",
" 2.852741e-05 | \n",
"
\n",
" \n",
" | 7 | \n",
" 1 | \n",
" 20 | \n",
" 7 | \n",
" 11138 | \n",
" 15.671798 | \n",
" R-HSA-8850843 | \n",
" Phosphate bond hydrolysis by NTPDase proteins | \n",
" 7.533920e-05 | \n",
"
\n",
" \n",
" | 8 | \n",
" 1 | \n",
" 20 | \n",
" 7 | \n",
" 11138 | \n",
" 15.671798 | \n",
" R-HSA-68952 | \n",
" DNA replication initiation | \n",
" 7.533920e-05 | \n",
"
\n",
" \n",
" | 9 | \n",
" 1 | \n",
" 20 | \n",
" 10 | \n",
" 11135 | \n",
" 11.137000 | \n",
" R-HSA-418885 | \n",
" DCC mediated attractive signaling | \n",
" 8.462266e-04 | \n",
"
\n",
" \n",
" | 10 | \n",
" 2 | \n",
" 19 | \n",
" 88 | \n",
" 11057 | \n",
" 10.567006 | \n",
" R-HSA-192823 | \n",
" Viral mRNA Translation | \n",
" 1.151240e-03 | \n",
"
\n",
" \n",
" | 11 | \n",
" 2 | \n",
" 19 | \n",
" 92 | \n",
" 11053 | \n",
" 10.006894 | \n",
" R-HSA-1799339 | \n",
" SRP-dependent cotranslational protein targetin... | \n",
" 1.559553e-03 | \n",
"
\n",
" \n",
" | 12 | \n",
" 2 | \n",
" 19 | \n",
" 94 | \n",
" 11051 | \n",
" 9.744550 | \n",
" R-HSA-975956 | \n",
" Nonsense Mediated Decay (NMD) independent of t... | \n",
" 1.798552e-03 | \n",
"
\n",
" \n",
" | 13 | \n",
" 2 | \n",
" 19 | \n",
" 100 | \n",
" 11045 | \n",
" 9.020030 | \n",
" R-HSA-72689 | \n",
" Formation of a pool of free 40S subunits | \n",
" 2.670370e-03 | \n",
"
\n",
" \n",
" | 14 | \n",
" 2 | \n",
" 19 | \n",
" 110 | \n",
" 11035 | \n",
" 7.987388 | \n",
" R-HSA-156827 | \n",
" L13a-mediated translational silencing of Cerul... | \n",
" 4.710432e-03 | \n",
"
\n",
" \n",
" | 15 | \n",
" 2 | \n",
" 19 | \n",
" 111 | \n",
" 11034 | \n",
" 7.894339 | \n",
" R-HSA-72706 | \n",
" GTP hydrolysis and joining of the 60S ribosoma... | \n",
" 4.958976e-03 | \n",
"
\n",
" \n",
" | 16 | \n",
" 2 | \n",
" 19 | \n",
" 115 | \n",
" 11030 | \n",
" 7.538335 | \n",
" R-HSA-975957 | \n",
" Nonsense Mediated Decay (NMD) enhanced by the ... | \n",
" 6.039983e-03 | \n",
"
\n",
" \n",
" | 17 | \n",
" 1 | \n",
" 20 | \n",
" 15 | \n",
" 11130 | \n",
" 7.362504 | \n",
" R-HSA-204174 | \n",
" Regulation of pyruvate dehydrogenase (PDH) com... | \n",
" 6.659799e-03 | \n",
"
\n",
" \n",
" | 18 | \n",
" 2 | \n",
" 19 | \n",
" 119 | \n",
" 11026 | \n",
" 7.206312 | \n",
" R-HSA-114608 | \n",
" Platelet degranulation | \n",
" 7.264762e-03 | \n",
"
\n",
" \n",
" | 19 | \n",
" 1 | \n",
" 20 | \n",
" 20 | \n",
" 11125 | \n",
" 5.389681 | \n",
" R-HSA-5651801 | \n",
" PCNA-Dependent Long Patch Base Excision Repair | \n",
" 2.025618e-02 | \n",
"
\n",
" \n",
" | 20 | \n",
" 1 | \n",
" 20 | \n",
" 23 | \n",
" 11122 | \n",
" 4.602357 | \n",
" R-HSA-203927 | \n",
" MicroRNA (miRNA) biogenesis | \n",
" 3.192804e-02 | \n",
"
\n",
" \n",
" | 21 | \n",
" 1 | \n",
" 20 | \n",
" 24 | \n",
" 11121 | \n",
" 4.382195 | \n",
" R-HSA-9660826 | \n",
" Purinergic signaling in leishmaniasis infection | \n",
" 3.631620e-02 | \n",
"
\n",
" \n",
" | 22 | \n",
" 1 | \n",
" 20 | \n",
" 24 | \n",
" 11121 | \n",
" 4.382195 | \n",
" R-HSA-210991 | \n",
" Basigin interactions | \n",
" 3.631620e-02 | \n",
"
\n",
" \n",
" | 23 | \n",
" 1 | \n",
" 20 | \n",
" 24 | \n",
" 11121 | \n",
" 4.382195 | \n",
" R-HSA-5223345 | \n",
" Miscellaneous transport and binding events | \n",
" 3.631620e-02 | \n",
"
\n",
" \n",
" | 24 | \n",
" 1 | \n",
" 20 | \n",
" 24 | \n",
" 11121 | \n",
" 4.382195 | \n",
" R-HSA-5696397 | \n",
" Gap-filling DNA repair synthesis and ligation ... | \n",
" 3.631620e-02 | \n",
"
\n",
" \n",
" | 25 | \n",
" 2 | \n",
" 19 | \n",
" 181 | \n",
" 10964 | \n",
" 3.953753 | \n",
" R-HSA-6791226 | \n",
" Major pathway of rRNA processing in the nucleo... | \n",
" 4.676696e-02 | \n",
"
\n",
" \n",
" | 26 | \n",
" 1 | \n",
" 20 | \n",
" 29 | \n",
" 11116 | \n",
" 3.503357 | \n",
" R-HSA-110314 | \n",
" Recognition of DNA damage by PCNA-containing r... | \n",
" 6.124455e-02 | \n",
"
\n",
" \n",
" | 27 | \n",
" 1 | \n",
" 20 | \n",
" 31 | \n",
" 11114 | \n",
" 3.229512 | \n",
" R-HSA-5656169 | \n",
" Termination of translesion DNA synthesis | \n",
" 7.232223e-02 | \n",
"
\n",
" \n",
" | 28 | \n",
" 1 | \n",
" 20 | \n",
" 33 | \n",
" 11112 | \n",
" 2.988310 | \n",
" R-HSA-352230 | \n",
" Amino acid transport across the plasma membrane | \n",
" 8.386764e-02 | \n",
"
\n",
" \n",
" | 29 | \n",
" 1 | \n",
" 20 | \n",
" 36 | \n",
" 11109 | \n",
" 2.676134 | \n",
" R-HSA-176187 | \n",
" Activation of ATR in response to replication s... | \n",
" 1.018627e-01 | \n",
"
\n",
" \n",
" | 30 | \n",
" 1 | \n",
" 20 | \n",
" 36 | \n",
" 11109 | \n",
" 2.676134 | \n",
" R-HSA-8950505 | \n",
" Gene and protein expression by JAK-STAT signal... | \n",
" 1.018627e-01 | \n",
"
\n",
" \n",
" | 31 | \n",
" 1 | \n",
" 20 | \n",
" 37 | \n",
" 11108 | \n",
" 2.583220 | \n",
" R-HSA-5083635 | \n",
" Defective B3GALTL causes Peters-plus syndrome ... | \n",
" 1.080017e-01 | \n",
"
\n",
" \n",
" | 32 | \n",
" 1 | \n",
" 20 | \n",
" 38 | \n",
" 11107 | \n",
" 2.495164 | \n",
" R-HSA-5173214 | \n",
" O-glycosylation of TSR domain-containing proteins | \n",
" 1.141965e-01 | \n",
"
\n",
" \n",
" | 33 | \n",
" 1 | \n",
" 20 | \n",
" 40 | \n",
" 11105 | \n",
" 2.332203 | \n",
" R-HSA-5696400 | \n",
" Dual Incision in GG-NER | \n",
" 1.267224e-01 | \n",
"
\n",
" \n",
" | 34 | \n",
" 1 | \n",
" 20 | \n",
" 54 | \n",
" 11091 | \n",
" 1.530727 | \n",
" R-HSA-9013409 | \n",
" RHOJ GTPase cycle | \n",
" 2.160034e-01 | \n",
"
\n",
" \n",
" | 35 | \n",
" 1 | \n",
" 20 | \n",
" 54 | \n",
" 11091 | \n",
" 1.530727 | \n",
" R-HSA-193648 | \n",
" NRAGE signals death through JNK | \n",
" 2.160034e-01 | \n",
"
\n",
" \n",
" | 36 | \n",
" 1 | \n",
" 20 | \n",
" 63 | \n",
" 11082 | \n",
" 1.206546 | \n",
" R-HSA-6782210 | \n",
" Gap-filling DNA repair synthesis and ligation ... | \n",
" 2.720173e-01 | \n",
"
\n",
" \n",
" | 37 | \n",
" 1 | \n",
" 20 | \n",
" 64 | \n",
" 11081 | \n",
" 1.176348 | \n",
" R-HSA-6782135 | \n",
" Dual incision in TC-NER | \n",
" 2.781007e-01 | \n",
"
\n",
" \n",
" | 38 | \n",
" 1 | \n",
" 20 | \n",
" 65 | \n",
" 11080 | \n",
" 1.147121 | \n",
" R-HSA-8936459 | \n",
" RUNX1 regulates genes involved in megakaryocyt... | \n",
" 2.841526e-01 | \n",
"
\n",
" \n",
" | 39 | \n",
" 1 | \n",
" 20 | \n",
" 70 | \n",
" 11075 | \n",
" 1.014107 | \n",
" R-HSA-68949 | \n",
" Orc1 removal from chromatin | \n",
" 3.139209e-01 | \n",
"
\n",
" \n",
" | 40 | \n",
" 1 | \n",
" 20 | \n",
" 73 | \n",
" 11072 | \n",
" 0.943520 | \n",
" R-HSA-9013408 | \n",
" RHOG GTPase cycle | \n",
" 3.313742e-01 | \n",
"
\n",
" \n",
" | 41 | \n",
" 1 | \n",
" 20 | \n",
" 79 | \n",
" 11066 | \n",
" 0.819465 | \n",
" R-HSA-416482 | \n",
" G alpha (12/13) signalling events | \n",
" 3.653366e-01 | \n",
"
\n",
" \n",
" | 42 | \n",
" 1 | \n",
" 20 | \n",
" 80 | \n",
" 11065 | \n",
" 0.800735 | \n",
" R-HSA-216083 | \n",
" Integrin cell surface interactions | \n",
" 3.708738e-01 | \n",
"
\n",
" \n",
" | 43 | \n",
" 1 | \n",
" 20 | \n",
" 87 | \n",
" 11058 | \n",
" 0.682709 | \n",
" R-HSA-9013404 | \n",
" RAC2 GTPase cycle | \n",
" 4.086556e-01 | \n",
"
\n",
" \n",
" | 44 | \n",
" 1 | \n",
" 20 | \n",
" 91 | \n",
" 11054 | \n",
" 0.624208 | \n",
" R-HSA-6804756 | \n",
" Regulation of TP53 Activity through Phosphoryl... | \n",
" 4.294877e-01 | \n",
"
\n",
" \n",
" | 45 | \n",
" 1 | \n",
" 20 | \n",
" 93 | \n",
" 11052 | \n",
" 0.597060 | \n",
" R-HSA-9013423 | \n",
" RAC3 GTPase cycle | \n",
" 4.397019e-01 | \n",
"
\n",
" \n",
" | 46 | \n",
" 1 | \n",
" 20 | \n",
" 99 | \n",
" 11046 | \n",
" 0.523015 | \n",
" R-HSA-6811434 | \n",
" COPI-dependent Golgi-to-ER retrograde traffic | \n",
" 4.695582e-01 | \n",
"
\n",
" \n",
" | 47 | \n",
" 1 | \n",
" 20 | \n",
" 101 | \n",
" 11044 | \n",
" 0.500559 | \n",
" R-HSA-6807878 | \n",
" COPI-mediated anterograde transport | \n",
" 4.792546e-01 | \n",
"
\n",
" \n",
" | 48 | \n",
" 1 | \n",
" 20 | \n",
" 149 | \n",
" 10996 | \n",
" 0.170910 | \n",
" R-HSA-8980692 | \n",
" RHOA GTPase cycle | \n",
" 6.793044e-01 | \n",
"
\n",
" \n",
" | 49 | \n",
" 1 | \n",
" 20 | \n",
" 158 | \n",
" 10987 | \n",
" 0.137271 | \n",
" R-HSA-9013148 | \n",
" CDC42 GTPase cycle | \n",
" 7.110095e-01 | \n",
"
\n",
" \n",
" | 50 | \n",
" 1 | \n",
" 20 | \n",
" 184 | \n",
" 10961 | \n",
" 0.067711 | \n",
" R-HSA-9013149 | \n",
" RAC1 GTPase cycle | \n",
" 7.946990e-01 | \n",
"
\n",
" \n",
" | 51 | \n",
" 1 | \n",
" 20 | \n",
" 204 | \n",
" 10941 | \n",
" 0.034678 | \n",
" R-HSA-5689880 | \n",
" Ub-specific processing proteases | \n",
" 8.522717e-01 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" in_gene_in_pathway ... p_value\n",
"0 2 ... 7.211088e-09\n",
"1 1 ... 1.313007e-08\n",
"2 1 ... 1.313007e-08\n",
"3 2 ... 6.842431e-06\n",
"4 1 ... 2.153825e-05\n",
"5 2 ... 2.852741e-05\n",
"6 2 ... 2.852741e-05\n",
"7 1 ... 7.533920e-05\n",
"8 1 ... 7.533920e-05\n",
"9 1 ... 8.462266e-04\n",
"10 2 ... 1.151240e-03\n",
"11 2 ... 1.559553e-03\n",
"12 2 ... 1.798552e-03\n",
"13 2 ... 2.670370e-03\n",
"14 2 ... 4.710432e-03\n",
"15 2 ... 4.958976e-03\n",
"16 2 ... 6.039983e-03\n",
"17 1 ... 6.659799e-03\n",
"18 2 ... 7.264762e-03\n",
"19 1 ... 2.025618e-02\n",
"20 1 ... 3.192804e-02\n",
"21 1 ... 3.631620e-02\n",
"22 1 ... 3.631620e-02\n",
"23 1 ... 3.631620e-02\n",
"24 1 ... 3.631620e-02\n",
"25 2 ... 4.676696e-02\n",
"26 1 ... 6.124455e-02\n",
"27 1 ... 7.232223e-02\n",
"28 1 ... 8.386764e-02\n",
"29 1 ... 1.018627e-01\n",
"30 1 ... 1.018627e-01\n",
"31 1 ... 1.080017e-01\n",
"32 1 ... 1.141965e-01\n",
"33 1 ... 1.267224e-01\n",
"34 1 ... 2.160034e-01\n",
"35 1 ... 2.160034e-01\n",
"36 1 ... 2.720173e-01\n",
"37 1 ... 2.781007e-01\n",
"38 1 ... 2.841526e-01\n",
"39 1 ... 3.139209e-01\n",
"40 1 ... 3.313742e-01\n",
"41 1 ... 3.653366e-01\n",
"42 1 ... 3.708738e-01\n",
"43 1 ... 4.086556e-01\n",
"44 1 ... 4.294877e-01\n",
"45 1 ... 4.397019e-01\n",
"46 1 ... 4.695582e-01\n",
"47 1 ... 4.792546e-01\n",
"48 1 ... 6.793044e-01\n",
"49 1 ... 7.110095e-01\n",
"50 1 ... 7.946990e-01\n",
"51 1 ... 8.522717e-01\n",
"\n",
"[52 rows x 8 columns]"
]
},
"metadata": {},
"execution_count": 30
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "TqPYE1JZDq5L"
},
"source": [
"## Adjust for Multiple Testing\n",
"\n",
"Since we're testing multiple pathways, we need to adjust the p-value threshold for significance accordingly. The number of pathways tested depends on whether or not we're considering all pathways, or just the lowest level pathways. That count can be obtained with the following query."
]
},
{
"cell_type": "code",
"metadata": {
"id": "7KziIxfED8kB"
},
"source": [
"# run query and put results in data frame\n",
"num_pathways_result = client.query((\"\"\"\n",
" SELECT\n",
" COUNT (*) AS num_pathways\n",
" FROM\n",
" `isb-cgc-bq.reactome_versioned.pathway_v77` as pathway\n",
" {lowest_level_filter}\n",
"\"\"\").format(\n",
" lowest_level_filter=(\"WHERE lowest_level = TRUE\" if lowest_level else \"\")\n",
")).result().to_dataframe()"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 80
},
"id": "E_7ncwlOEmuG",
"outputId": "2a300402-e3cc-4cdf-f507-536a4ef38fdd"
},
"source": [
"# display data frame\n",
"num_pathways_result"
],
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" num_pathways | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 1773 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" num_pathways\n",
"0 1773"
]
},
"metadata": {},
"execution_count": 32
}
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "7B9__TlA4M9u",
"outputId": "0ef921bf-8224-4b14-ce6a-d3418ccaea57"
},
"source": [
"# adjust significance threshold for multiple testing, using a p-value of 0.01\n",
"num_pathways = num_pathways_result['num_pathways'][0]\n",
"significance_threshold = 0.01/num_pathways\n",
"print('Significance Threshold: {}'.format(significance_threshold))\n",
"\n",
"# find all pathways that meet the significance criterion after adjusting for\n",
"# multiple testing\n",
"significant_pathway_index = chi_squared_pathways['p_value']\n",
"\n",
"\n",
" \n",
" \n",
" | \n",
" in_gene_in_pathway | \n",
" in_gene_not_pathway | \n",
" not_gene_in_pathway | \n",
" not_gene_not_pathway | \n",
" chi_squared_stat | \n",
" stable_id | \n",
" name | \n",
" p_value | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 2 | \n",
" 19 | \n",
" 31 | \n",
" 11114 | \n",
" 33.477023 | \n",
" R-HSA-68962 | \n",
" Activation of the pre-replicative complex | \n",
" 7.211088e-09 | \n",
"
\n",
" \n",
" | 1 | \n",
" 1 | \n",
" 20 | \n",
" 3 | \n",
" 11142 | \n",
" 32.311989 | \n",
" R-HSA-1237112 | \n",
" Methionine salvage pathway | \n",
" 1.313007e-08 | \n",
"
\n",
" \n",
" | 2 | \n",
" 1 | \n",
" 20 | \n",
" 3 | \n",
" 11142 | \n",
" 32.311989 | \n",
" R-HSA-9013425 | \n",
" RHOT1 GTPase cycle | \n",
" 1.313007e-08 | \n",
"
\n",
" \n",
"
\n",
""
],
"text/plain": [
" in_gene_in_pathway ... p_value\n",
"0 2 ... 7.211088e-09\n",
"1 1 ... 1.313007e-08\n",
"2 1 ... 1.313007e-08\n",
"\n",
"[3 rows x 8 columns]"
]
},
"metadata": {},
"execution_count": 34
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "if9v8H2T12cp"
},
"source": [
"The results of this analysis suggest that at least three pathways may be related to the genes identified. "
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "s31t8Ed-7cfK"
},
"source": [
"## Verify Results by Comparing to SciPy Chi-Squared Function\n",
"\n",
"We verify these BigQuery results by calculating the same chi-squared statistic using the SciPy package. Comparing the SciPy p-values and statistics to those of the BigQuery-derived results confirms that they are identical."
]
},
{
"cell_type": "code",
"metadata": {
"id": "-BPVmeMF1CGO"
},
"source": [
"# extract observed values from bigquery result\n",
"observed = chi_squared_pathways[[\n",
" 'in_gene_in_pathway',\n",
" 'in_gene_not_pathway',\n",
" 'not_gene_in_pathway',\n",
" 'not_gene_not_pathway'\n",
"]]\n",
"\n",
"# calculate the chi-squared statistic using the scipy stats package \n",
"chi2_stat = []\n",
"chi2_pvalue = []\n",
"for index, row in observed.iterrows():\n",
" stat, pvalue, _, _ = stats.chi2_contingency(\n",
" np.reshape(np.matrix(row), (2,2)), correction=True\n",
" )\n",
" chi2_stat.append(stat)\n",
" chi2_pvalue.append(pvalue)\n",
"\n",
"# add columns to the original data frame\n",
"chi_squared_pathways['scipy_stat'] = chi2_stat\n",
"chi_squared_pathways['scipy_p_value'] = chi2_pvalue"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "WWgNFm6w36s9",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 1000
},
"outputId": "612d8812-294b-43a4-98e8-dad78af21f40"
},
"source": [
"# display the updated data frame\n",
"chi_squared_pathways"
],
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" in_gene_in_pathway | \n",
" in_gene_not_pathway | \n",
" not_gene_in_pathway | \n",
" not_gene_not_pathway | \n",
" chi_squared_stat | \n",
" stable_id | \n",
" name | \n",
" p_value | \n",
" scipy_stat | \n",
" scipy_p_value | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 2 | \n",
" 19 | \n",
" 31 | \n",
" 11114 | \n",
" 33.477023 | \n",
" R-HSA-68962 | \n",
" Activation of the pre-replicative complex | \n",
" 7.211088e-09 | \n",
" 33.477023 | \n",
" 7.211088e-09 | \n",
"
\n",
" \n",
" | 1 | \n",
" 1 | \n",
" 20 | \n",
" 3 | \n",
" 11142 | \n",
" 32.311989 | \n",
" R-HSA-1237112 | \n",
" Methionine salvage pathway | \n",
" 1.313007e-08 | \n",
" 32.311989 | \n",
" 1.313007e-08 | \n",
"
\n",
" \n",
" | 2 | \n",
" 1 | \n",
" 20 | \n",
" 3 | \n",
" 11142 | \n",
" 32.311989 | \n",
" R-HSA-9013425 | \n",
" RHOT1 GTPase cycle | \n",
" 1.313007e-08 | \n",
" 32.311989 | \n",
" 1.313007e-08 | \n",
"
\n",
" \n",
" | 3 | \n",
" 2 | \n",
" 19 | \n",
" 50 | \n",
" 11095 | \n",
" 20.236790 | \n",
" R-HSA-72695 | \n",
" Formation of the ternary complex, and subseque... | \n",
" 6.842431e-06 | \n",
" 20.236790 | \n",
" 6.842431e-06 | \n",
"
\n",
" \n",
" | 4 | \n",
" 1 | \n",
" 20 | \n",
" 6 | \n",
" 11139 | \n",
" 18.048197 | \n",
" R-HSA-163765 | \n",
" ChREBP activates metabolic gene expression | \n",
" 2.153825e-05 | \n",
" 18.048197 | \n",
" 2.153825e-05 | \n",
"
\n",
" \n",
" | 5 | \n",
" 2 | \n",
" 19 | \n",
" 57 | \n",
" 11088 | \n",
" 17.513505 | \n",
" R-HSA-72649 | \n",
" Translation initiation complex formation | \n",
" 2.852741e-05 | \n",
" 17.513505 | \n",
" 2.852741e-05 | \n",
"
\n",
" \n",
" | 6 | \n",
" 2 | \n",
" 19 | \n",
" 57 | \n",
" 11088 | \n",
" 17.513505 | \n",
" R-HSA-72702 | \n",
" Ribosomal scanning and start codon recognition | \n",
" 2.852741e-05 | \n",
" 17.513505 | \n",
" 2.852741e-05 | \n",
"
\n",
" \n",
" | 7 | \n",
" 1 | \n",
" 20 | \n",
" 7 | \n",
" 11138 | \n",
" 15.671798 | \n",
" R-HSA-8850843 | \n",
" Phosphate bond hydrolysis by NTPDase proteins | \n",
" 7.533920e-05 | \n",
" 15.671798 | \n",
" 7.533920e-05 | \n",
"
\n",
" \n",
" | 8 | \n",
" 1 | \n",
" 20 | \n",
" 7 | \n",
" 11138 | \n",
" 15.671798 | \n",
" R-HSA-68952 | \n",
" DNA replication initiation | \n",
" 7.533920e-05 | \n",
" 15.671798 | \n",
" 7.533920e-05 | \n",
"
\n",
" \n",
" | 9 | \n",
" 1 | \n",
" 20 | \n",
" 10 | \n",
" 11135 | \n",
" 11.137000 | \n",
" R-HSA-418885 | \n",
" DCC mediated attractive signaling | \n",
" 8.462266e-04 | \n",
" 11.137000 | \n",
" 8.462266e-04 | \n",
"
\n",
" \n",
" | 10 | \n",
" 2 | \n",
" 19 | \n",
" 88 | \n",
" 11057 | \n",
" 10.567006 | \n",
" R-HSA-192823 | \n",
" Viral mRNA Translation | \n",
" 1.151240e-03 | \n",
" 10.567006 | \n",
" 1.151240e-03 | \n",
"
\n",
" \n",
" | 11 | \n",
" 2 | \n",
" 19 | \n",
" 92 | \n",
" 11053 | \n",
" 10.006894 | \n",
" R-HSA-1799339 | \n",
" SRP-dependent cotranslational protein targetin... | \n",
" 1.559553e-03 | \n",
" 10.006894 | \n",
" 1.559553e-03 | \n",
"
\n",
" \n",
" | 12 | \n",
" 2 | \n",
" 19 | \n",
" 94 | \n",
" 11051 | \n",
" 9.744550 | \n",
" R-HSA-975956 | \n",
" Nonsense Mediated Decay (NMD) independent of t... | \n",
" 1.798552e-03 | \n",
" 9.744550 | \n",
" 1.798552e-03 | \n",
"
\n",
" \n",
" | 13 | \n",
" 2 | \n",
" 19 | \n",
" 100 | \n",
" 11045 | \n",
" 9.020030 | \n",
" R-HSA-72689 | \n",
" Formation of a pool of free 40S subunits | \n",
" 2.670370e-03 | \n",
" 9.020030 | \n",
" 2.670370e-03 | \n",
"
\n",
" \n",
" | 14 | \n",
" 2 | \n",
" 19 | \n",
" 110 | \n",
" 11035 | \n",
" 7.987388 | \n",
" R-HSA-156827 | \n",
" L13a-mediated translational silencing of Cerul... | \n",
" 4.710432e-03 | \n",
" 7.987388 | \n",
" 4.710432e-03 | \n",
"
\n",
" \n",
" | 15 | \n",
" 2 | \n",
" 19 | \n",
" 111 | \n",
" 11034 | \n",
" 7.894339 | \n",
" R-HSA-72706 | \n",
" GTP hydrolysis and joining of the 60S ribosoma... | \n",
" 4.958976e-03 | \n",
" 7.894339 | \n",
" 4.958976e-03 | \n",
"
\n",
" \n",
" | 16 | \n",
" 2 | \n",
" 19 | \n",
" 115 | \n",
" 11030 | \n",
" 7.538335 | \n",
" R-HSA-975957 | \n",
" Nonsense Mediated Decay (NMD) enhanced by the ... | \n",
" 6.039983e-03 | \n",
" 7.538335 | \n",
" 6.039983e-03 | \n",
"
\n",
" \n",
" | 17 | \n",
" 1 | \n",
" 20 | \n",
" 15 | \n",
" 11130 | \n",
" 7.362504 | \n",
" R-HSA-204174 | \n",
" Regulation of pyruvate dehydrogenase (PDH) com... | \n",
" 6.659799e-03 | \n",
" 7.362504 | \n",
" 6.659799e-03 | \n",
"
\n",
" \n",
" | 18 | \n",
" 2 | \n",
" 19 | \n",
" 119 | \n",
" 11026 | \n",
" 7.206312 | \n",
" R-HSA-114608 | \n",
" Platelet degranulation | \n",
" 7.264762e-03 | \n",
" 7.206312 | \n",
" 7.264762e-03 | \n",
"
\n",
" \n",
" | 19 | \n",
" 1 | \n",
" 20 | \n",
" 20 | \n",
" 11125 | \n",
" 5.389681 | \n",
" R-HSA-5651801 | \n",
" PCNA-Dependent Long Patch Base Excision Repair | \n",
" 2.025618e-02 | \n",
" 5.389681 | \n",
" 2.025618e-02 | \n",
"
\n",
" \n",
" | 20 | \n",
" 1 | \n",
" 20 | \n",
" 23 | \n",
" 11122 | \n",
" 4.602357 | \n",
" R-HSA-203927 | \n",
" MicroRNA (miRNA) biogenesis | \n",
" 3.192804e-02 | \n",
" 4.602357 | \n",
" 3.192804e-02 | \n",
"
\n",
" \n",
" | 21 | \n",
" 1 | \n",
" 20 | \n",
" 24 | \n",
" 11121 | \n",
" 4.382195 | \n",
" R-HSA-9660826 | \n",
" Purinergic signaling in leishmaniasis infection | \n",
" 3.631620e-02 | \n",
" 4.382195 | \n",
" 3.631620e-02 | \n",
"
\n",
" \n",
" | 22 | \n",
" 1 | \n",
" 20 | \n",
" 24 | \n",
" 11121 | \n",
" 4.382195 | \n",
" R-HSA-210991 | \n",
" Basigin interactions | \n",
" 3.631620e-02 | \n",
" 4.382195 | \n",
" 3.631620e-02 | \n",
"
\n",
" \n",
" | 23 | \n",
" 1 | \n",
" 20 | \n",
" 24 | \n",
" 11121 | \n",
" 4.382195 | \n",
" R-HSA-5223345 | \n",
" Miscellaneous transport and binding events | \n",
" 3.631620e-02 | \n",
" 4.382195 | \n",
" 3.631620e-02 | \n",
"
\n",
" \n",
" | 24 | \n",
" 1 | \n",
" 20 | \n",
" 24 | \n",
" 11121 | \n",
" 4.382195 | \n",
" R-HSA-5696397 | \n",
" Gap-filling DNA repair synthesis and ligation ... | \n",
" 3.631620e-02 | \n",
" 4.382195 | \n",
" 3.631620e-02 | \n",
"
\n",
" \n",
" | 25 | \n",
" 2 | \n",
" 19 | \n",
" 181 | \n",
" 10964 | \n",
" 3.953753 | \n",
" R-HSA-6791226 | \n",
" Major pathway of rRNA processing in the nucleo... | \n",
" 4.676696e-02 | \n",
" 3.953753 | \n",
" 4.676696e-02 | \n",
"
\n",
" \n",
" | 26 | \n",
" 1 | \n",
" 20 | \n",
" 29 | \n",
" 11116 | \n",
" 3.503357 | \n",
" R-HSA-110314 | \n",
" Recognition of DNA damage by PCNA-containing r... | \n",
" 6.124455e-02 | \n",
" 3.503357 | \n",
" 6.124455e-02 | \n",
"
\n",
" \n",
" | 27 | \n",
" 1 | \n",
" 20 | \n",
" 31 | \n",
" 11114 | \n",
" 3.229512 | \n",
" R-HSA-5656169 | \n",
" Termination of translesion DNA synthesis | \n",
" 7.232223e-02 | \n",
" 3.229512 | \n",
" 7.232223e-02 | \n",
"
\n",
" \n",
" | 28 | \n",
" 1 | \n",
" 20 | \n",
" 33 | \n",
" 11112 | \n",
" 2.988310 | \n",
" R-HSA-352230 | \n",
" Amino acid transport across the plasma membrane | \n",
" 8.386764e-02 | \n",
" 2.988310 | \n",
" 8.386764e-02 | \n",
"
\n",
" \n",
" | 29 | \n",
" 1 | \n",
" 20 | \n",
" 36 | \n",
" 11109 | \n",
" 2.676134 | \n",
" R-HSA-176187 | \n",
" Activation of ATR in response to replication s... | \n",
" 1.018627e-01 | \n",
" 2.676134 | \n",
" 1.018627e-01 | \n",
"
\n",
" \n",
" | 30 | \n",
" 1 | \n",
" 20 | \n",
" 36 | \n",
" 11109 | \n",
" 2.676134 | \n",
" R-HSA-8950505 | \n",
" Gene and protein expression by JAK-STAT signal... | \n",
" 1.018627e-01 | \n",
" 2.676134 | \n",
" 1.018627e-01 | \n",
"
\n",
" \n",
" | 31 | \n",
" 1 | \n",
" 20 | \n",
" 37 | \n",
" 11108 | \n",
" 2.583220 | \n",
" R-HSA-5083635 | \n",
" Defective B3GALTL causes Peters-plus syndrome ... | \n",
" 1.080017e-01 | \n",
" 2.583220 | \n",
" 1.080017e-01 | \n",
"
\n",
" \n",
" | 32 | \n",
" 1 | \n",
" 20 | \n",
" 38 | \n",
" 11107 | \n",
" 2.495164 | \n",
" R-HSA-5173214 | \n",
" O-glycosylation of TSR domain-containing proteins | \n",
" 1.141965e-01 | \n",
" 2.495164 | \n",
" 1.141965e-01 | \n",
"
\n",
" \n",
" | 33 | \n",
" 1 | \n",
" 20 | \n",
" 40 | \n",
" 11105 | \n",
" 2.332203 | \n",
" R-HSA-5696400 | \n",
" Dual Incision in GG-NER | \n",
" 1.267224e-01 | \n",
" 2.332203 | \n",
" 1.267224e-01 | \n",
"
\n",
" \n",
" | 34 | \n",
" 1 | \n",
" 20 | \n",
" 54 | \n",
" 11091 | \n",
" 1.530727 | \n",
" R-HSA-9013409 | \n",
" RHOJ GTPase cycle | \n",
" 2.160034e-01 | \n",
" 1.530727 | \n",
" 2.160034e-01 | \n",
"
\n",
" \n",
" | 35 | \n",
" 1 | \n",
" 20 | \n",
" 54 | \n",
" 11091 | \n",
" 1.530727 | \n",
" R-HSA-193648 | \n",
" NRAGE signals death through JNK | \n",
" 2.160034e-01 | \n",
" 1.530727 | \n",
" 2.160034e-01 | \n",
"
\n",
" \n",
" | 36 | \n",
" 1 | \n",
" 20 | \n",
" 63 | \n",
" 11082 | \n",
" 1.206546 | \n",
" R-HSA-6782210 | \n",
" Gap-filling DNA repair synthesis and ligation ... | \n",
" 2.720173e-01 | \n",
" 1.206546 | \n",
" 2.720173e-01 | \n",
"
\n",
" \n",
" | 37 | \n",
" 1 | \n",
" 20 | \n",
" 64 | \n",
" 11081 | \n",
" 1.176348 | \n",
" R-HSA-6782135 | \n",
" Dual incision in TC-NER | \n",
" 2.781007e-01 | \n",
" 1.176348 | \n",
" 2.781007e-01 | \n",
"
\n",
" \n",
" | 38 | \n",
" 1 | \n",
" 20 | \n",
" 65 | \n",
" 11080 | \n",
" 1.147121 | \n",
" R-HSA-8936459 | \n",
" RUNX1 regulates genes involved in megakaryocyt... | \n",
" 2.841526e-01 | \n",
" 1.147121 | \n",
" 2.841526e-01 | \n",
"
\n",
" \n",
" | 39 | \n",
" 1 | \n",
" 20 | \n",
" 70 | \n",
" 11075 | \n",
" 1.014107 | \n",
" R-HSA-68949 | \n",
" Orc1 removal from chromatin | \n",
" 3.139209e-01 | \n",
" 1.014107 | \n",
" 3.139209e-01 | \n",
"
\n",
" \n",
" | 40 | \n",
" 1 | \n",
" 20 | \n",
" 73 | \n",
" 11072 | \n",
" 0.943520 | \n",
" R-HSA-9013408 | \n",
" RHOG GTPase cycle | \n",
" 3.313742e-01 | \n",
" 0.943520 | \n",
" 3.313742e-01 | \n",
"
\n",
" \n",
" | 41 | \n",
" 1 | \n",
" 20 | \n",
" 79 | \n",
" 11066 | \n",
" 0.819465 | \n",
" R-HSA-416482 | \n",
" G alpha (12/13) signalling events | \n",
" 3.653366e-01 | \n",
" 0.819465 | \n",
" 3.653366e-01 | \n",
"
\n",
" \n",
" | 42 | \n",
" 1 | \n",
" 20 | \n",
" 80 | \n",
" 11065 | \n",
" 0.800735 | \n",
" R-HSA-216083 | \n",
" Integrin cell surface interactions | \n",
" 3.708738e-01 | \n",
" 0.800735 | \n",
" 3.708738e-01 | \n",
"
\n",
" \n",
" | 43 | \n",
" 1 | \n",
" 20 | \n",
" 87 | \n",
" 11058 | \n",
" 0.682709 | \n",
" R-HSA-9013404 | \n",
" RAC2 GTPase cycle | \n",
" 4.086556e-01 | \n",
" 0.682709 | \n",
" 4.086556e-01 | \n",
"
\n",
" \n",
" | 44 | \n",
" 1 | \n",
" 20 | \n",
" 91 | \n",
" 11054 | \n",
" 0.624208 | \n",
" R-HSA-6804756 | \n",
" Regulation of TP53 Activity through Phosphoryl... | \n",
" 4.294877e-01 | \n",
" 0.624208 | \n",
" 4.294877e-01 | \n",
"
\n",
" \n",
" | 45 | \n",
" 1 | \n",
" 20 | \n",
" 93 | \n",
" 11052 | \n",
" 0.597060 | \n",
" R-HSA-9013423 | \n",
" RAC3 GTPase cycle | \n",
" 4.397019e-01 | \n",
" 0.597060 | \n",
" 4.397019e-01 | \n",
"
\n",
" \n",
" | 46 | \n",
" 1 | \n",
" 20 | \n",
" 99 | \n",
" 11046 | \n",
" 0.523015 | \n",
" R-HSA-6811434 | \n",
" COPI-dependent Golgi-to-ER retrograde traffic | \n",
" 4.695582e-01 | \n",
" 0.523015 | \n",
" 4.695582e-01 | \n",
"
\n",
" \n",
" | 47 | \n",
" 1 | \n",
" 20 | \n",
" 101 | \n",
" 11044 | \n",
" 0.500559 | \n",
" R-HSA-6807878 | \n",
" COPI-mediated anterograde transport | \n",
" 4.792546e-01 | \n",
" 0.500559 | \n",
" 4.792546e-01 | \n",
"
\n",
" \n",
" | 48 | \n",
" 1 | \n",
" 20 | \n",
" 149 | \n",
" 10996 | \n",
" 0.170910 | \n",
" R-HSA-8980692 | \n",
" RHOA GTPase cycle | \n",
" 6.793044e-01 | \n",
" 0.170910 | \n",
" 6.793044e-01 | \n",
"
\n",
" \n",
" | 49 | \n",
" 1 | \n",
" 20 | \n",
" 158 | \n",
" 10987 | \n",
" 0.137271 | \n",
" R-HSA-9013148 | \n",
" CDC42 GTPase cycle | \n",
" 7.110095e-01 | \n",
" 0.137271 | \n",
" 7.110095e-01 | \n",
"
\n",
" \n",
" | 50 | \n",
" 1 | \n",
" 20 | \n",
" 184 | \n",
" 10961 | \n",
" 0.067711 | \n",
" R-HSA-9013149 | \n",
" RAC1 GTPase cycle | \n",
" 7.946990e-01 | \n",
" 0.067711 | \n",
" 7.946990e-01 | \n",
"
\n",
" \n",
" | 51 | \n",
" 1 | \n",
" 20 | \n",
" 204 | \n",
" 10941 | \n",
" 0.034678 | \n",
" R-HSA-5689880 | \n",
" Ub-specific processing proteases | \n",
" 8.522717e-01 | \n",
" 0.034678 | \n",
" 8.522717e-01 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" in_gene_in_pathway in_gene_not_pathway ... scipy_stat scipy_p_value\n",
"0 2 19 ... 33.477023 7.211088e-09\n",
"1 1 20 ... 32.311989 1.313007e-08\n",
"2 1 20 ... 32.311989 1.313007e-08\n",
"3 2 19 ... 20.236790 6.842431e-06\n",
"4 1 20 ... 18.048197 2.153825e-05\n",
"5 2 19 ... 17.513505 2.852741e-05\n",
"6 2 19 ... 17.513505 2.852741e-05\n",
"7 1 20 ... 15.671798 7.533920e-05\n",
"8 1 20 ... 15.671798 7.533920e-05\n",
"9 1 20 ... 11.137000 8.462266e-04\n",
"10 2 19 ... 10.567006 1.151240e-03\n",
"11 2 19 ... 10.006894 1.559553e-03\n",
"12 2 19 ... 9.744550 1.798552e-03\n",
"13 2 19 ... 9.020030 2.670370e-03\n",
"14 2 19 ... 7.987388 4.710432e-03\n",
"15 2 19 ... 7.894339 4.958976e-03\n",
"16 2 19 ... 7.538335 6.039983e-03\n",
"17 1 20 ... 7.362504 6.659799e-03\n",
"18 2 19 ... 7.206312 7.264762e-03\n",
"19 1 20 ... 5.389681 2.025618e-02\n",
"20 1 20 ... 4.602357 3.192804e-02\n",
"21 1 20 ... 4.382195 3.631620e-02\n",
"22 1 20 ... 4.382195 3.631620e-02\n",
"23 1 20 ... 4.382195 3.631620e-02\n",
"24 1 20 ... 4.382195 3.631620e-02\n",
"25 2 19 ... 3.953753 4.676696e-02\n",
"26 1 20 ... 3.503357 6.124455e-02\n",
"27 1 20 ... 3.229512 7.232223e-02\n",
"28 1 20 ... 2.988310 8.386764e-02\n",
"29 1 20 ... 2.676134 1.018627e-01\n",
"30 1 20 ... 2.676134 1.018627e-01\n",
"31 1 20 ... 2.583220 1.080017e-01\n",
"32 1 20 ... 2.495164 1.141965e-01\n",
"33 1 20 ... 2.332203 1.267224e-01\n",
"34 1 20 ... 1.530727 2.160034e-01\n",
"35 1 20 ... 1.530727 2.160034e-01\n",
"36 1 20 ... 1.206546 2.720173e-01\n",
"37 1 20 ... 1.176348 2.781007e-01\n",
"38 1 20 ... 1.147121 2.841526e-01\n",
"39 1 20 ... 1.014107 3.139209e-01\n",
"40 1 20 ... 0.943520 3.313742e-01\n",
"41 1 20 ... 0.819465 3.653366e-01\n",
"42 1 20 ... 0.800735 3.708738e-01\n",
"43 1 20 ... 0.682709 4.086556e-01\n",
"44 1 20 ... 0.624208 4.294877e-01\n",
"45 1 20 ... 0.597060 4.397019e-01\n",
"46 1 20 ... 0.523015 4.695582e-01\n",
"47 1 20 ... 0.500559 4.792546e-01\n",
"48 1 20 ... 0.170910 6.793044e-01\n",
"49 1 20 ... 0.137271 7.110095e-01\n",
"50 1 20 ... 0.067711 7.946990e-01\n",
"51 1 20 ... 0.034678 8.522717e-01\n",
"\n",
"[52 rows x 10 columns]"
]
},
"metadata": {},
"execution_count": 36
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "HBGSd1wS5GRK"
},
"source": [
"# Conclusion\n",
"\n",
"This notebook demonstrated usage of the Reactome BigQuery dataset for basic cancer pathway identification from a gene set, as well as a more complex pathway enrichment analysis using a chi-squared statistic."
]
}
]
}