{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Regulome Explorer Notebook \n", "\n", "This notebook computes significant association scores between pairwise data types available in the PanCancer Atlas dataset of ISB-CGC. The specific statistical tests implmeneted are described ['here'](https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/RegulomeExplorerNotebooks.html#standard-pairwise-statistics), and a description of the original Regulomen Explorer is avaiable ['here'](https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/RegulomeExplorerNotebooks.html#id5).\n", "\n", "The output of the notebook is a table of significacnt associations specified by correltions and p-values. This notebook also performs a more detailed analysis from a user specified pair of features names generating figures and additional statistics." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Authentication\n", "The first step is to authorize access to BigQuery and the Google Cloud. For more information see ['Quick Start Guide to ISB-CGC'](https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/HowToGetStartedonISB-CGC.html) and alternative authentication methods can be found [here](https://googleapis.github.io/google-cloud-python/latest/core/auth.html)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Import Python libraries" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "scrolled": true }, "outputs": [], "source": [ "%load_ext autoreload\n", "%autoreload 2\n", "%matplotlib inline\n", "from google.cloud import bigquery\n", "import pandas as pd\n", "import re_module.bq_functions as regulome" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Specify Parameters\n", "The parameters for this experiment are the cancer type (study), a list of genes, a couple of molecular features (), the significance level, and the minimum number of samples required for the statistical analysis. " ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "scrolled": true }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "d67f15cabead4ce2b0c07e971942e4df", "version_major": 2, "version_minor": 0 }, "text/plain": [ "HBox(children=(HTML(value='Select Feature1 '), Dropdown(options=('Gene Expression', 'Somatic Copy Num…" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "36c0b62829b7434ca629eeedd0662244", "version_major": 2, "version_minor": 0 }, "text/plain": [ "HBox(children=(HTML(value='Feature1 labels '), Text(value='IGF2, ADAM6', placeholder='Type gene names…" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "8056c84c1e7040cfae961cd70455d65a", "version_major": 2, "version_minor": 0 }, "text/plain": [ "HBox(children=(HTML(value='Select Feature2 '), Dropdown(options=('Gene Expression', 'Somatic Mutation…" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "34f11c505ac34b95936a6e862a60d2af", "version_major": 2, "version_minor": 0 }, "text/plain": [ "HBox(children=(HTML(value='Select a study '), Dropdown(index=30, options=('ACC', 'BLCA', 'BRCA', 'CES…" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "454af58f8a9c4f35aa3c8de863605543", "version_major": 2, "version_minor": 0 }, "text/plain": [ "HBox(children=(HTML(value='Significance level '), SelectionSlider(continuous_update=False, index=1, o…" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "3c536964291c45b487f4587bf6995294", "version_major": 2, "version_minor": 0 }, "text/plain": [ "HBox(children=(HTML(value='Minimum number of samples'), IntSlider(value=25, max=50, min=5)))" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "d2a6f40c1c6d4b01bdbec375c743bbe6", "version_major": 2, "version_minor": 0 }, "text/plain": [ "HBox(children=(HTML(value='Cohort list'), FileUpload(value={}, description='Upload')))" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "[study, feature1, feature2, gene_names, size, cohortlist, significance] = regulome.makeWidgets()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Build the query\n", "The Bigquery query to compute associations between feature 1 and 2 are created using functions in the 'regulome' module. Please refer to our github repository to access the notebooks with description of the methods used for each possible combination of features available in TCGA: https://github.com/isb-cgc/Community-Notebooks/tree/master/RegulomeExplorer." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CREATE TEMP FUNCTION erfcc(x FLOAT64)\n", "RETURNS FLOAT64\n", "LANGUAGE js AS \"\"\"\n", " \n", " var t; \n", " var z; \n", " var ans;\n", " z = Math.abs(x) ;\n", " t = 1.0 / (1.0 + 0.5*z ) ;\n", " \n", " ans= t * Math.exp(-z*z-1.26551223+t*(1.00002368+t*(0.37409196+t*(0.09678418+\n", "t*(-0.18628806+t*(0.27886807+t*(-1.13520398+t*(1.48851587+\n", "t*(-0.82215223+t*0.17087277)))))))));\n", " \n", " if ( x >= 0 ) {\n", " return ans ;\n", " } else {\n", " return 2.0 - ans;\n", " }\n", "\"\"\";\n", "\n", "WITH\n", "table1 AS (\n", "SELECT\n", " symbol,\n", " (RANK() OVER (PARTITION BY symbol ORDER BY data ASC)) + (COUNT(*) OVER ( PARTITION BY symbol, CAST(data as STRING)) - 1)/2.0 AS rnkdata,\n", " ParticipantBarcode\n", "FROM (\n", " SELECT\n", " Symbol AS symbol, \n", " AVG( LOG10( normalized_count + 1 ) ) AS data,\n", " ParticipantBarcode AS ParticipantBarcode\n", " FROM `pancancer-atlas.Filtered.EBpp_AdjustPANCAN_IlluminaHiSeq_RNASeqV2_genExp_filtered`\n", " WHERE Study = 'PAAD' # cohort \n", " AND Symbol IN UNNEST(@GENELIST) # labels \n", " AND normalized_count IS NOT NULL \n", " GROUP BY\n", " ParticipantBarcode, symbol\n", " )\n", ")\n", ",\n", "table2 AS (\n", "SELECT\n", " symbol,\n", " (RANK() OVER (PARTITION BY symbol ORDER BY data ASC)) + (COUNT(*) OVER ( PARTITION BY symbol, CAST(data as STRING)) - 1)/2.0 AS rnkdata,\n", " ParticipantBarcode\n", "FROM (\n", " SELECT\n", " Symbol AS symbol, \n", " AVG( LOG10( normalized_count + 1 ) ) AS data,\n", " ParticipantBarcode AS ParticipantBarcode\n", " FROM `pancancer-atlas.Filtered.EBpp_AdjustPANCAN_IlluminaHiSeq_RNASeqV2_genExp_filtered`\n", " WHERE Study = 'PAAD' # cohort \n", " AND Symbol IS NOT NULL # labels \n", " AND normalized_count IS NOT NULL \n", " GROUP BY\n", " ParticipantBarcode, symbol\n", " )\n", ")\n", ",\n", "summ_table AS (\n", "SELECT \n", " n1.symbol as symbol1,\n", " n2.symbol as symbol2,\n", " COUNT( n1.ParticipantBarcode ) as n,\n", " CORR(n1.rnkdata , n2.rnkdata) as correlation\n", " \n", "FROM\n", " table1 AS n1\n", "INNER JOIN\n", " table2 AS n2\n", "ON\n", " n1.ParticipantBarcode = n2.ParticipantBarcode\n", " AND n2.symbol NOT IN UNNEST(@GENELIST)\n", "GROUP BY\n", " symbol1, symbol2\n", "UNION ALL\n", "SELECT \n", " n1.symbol as symbol1,\n", " n2.symbol as symbol2,\n", " COUNT( n1.ParticipantBarcode ) as n,\n", " CORR(n1.rnkdata , n2.rnkdata) as correlation\n", " \n", "FROM\n", " table1 AS n1\n", "INNER JOIN\n", " table1 AS n2\n", "ON\n", " n1.ParticipantBarcode = n2.ParticipantBarcode\n", " AND n1.symbol < n2.symbol\n", "GROUP BY\n", " symbol1, symbol2\n", ")\n", "SELECT symbol1, symbol2, n, correlation\n", "FROM summ_table\n", "WHERE \n", " n > 25 AND n < 500 AND NOT IS_NAN( correlation)\n", "GROUP BY 1,2,3,4\n", "HAVING `cgc-05-0042.functions.significance_level_ttest2`(n-2, ABS(correlation)*SQRT((n-2)/((1+correlation)*(1-correlation)))) <= 0.05\n", "UNION ALL\n", "SELECT symbol1, symbol2, n, correlation \n", "FROM summ_table\n", "WHERE \n", " n >= 500 AND NOT IS_NAN( correlation)\n", "GROUP BY 1,2,3,4\n", "HAVING erfcc( ABS(correlation)*SQRT(n)/1.414213562373095 ) <= 0.05\n", "ORDER BY ABS(correlation) DESC\n", "\n" ] } ], "source": [ "SampleList, PatientList = regulome.readcohort( cohortlist )\n", "LabelList = [ x.strip() for x in gene_names.value.split(',') ]\n", "\n", "funct1 = regulome.approx_significant_level( )\n", "table1, table2 = regulome.get_feature_tables(study.value,feature1.value,feature2.value,SampleList,PatientList,LabelList)\n", "str_summarized = regulome.get_summarized_pancanatlas( feature1.value, feature2.value )\n", "str_stats = regulome.get_stat_pancanatlas(feature1.value, feature2.value, size.value, significance.value )\n", "\n", "sql = (funct1 + 'WITH' + table1 + ',' + table2 + ',' + str_summarized + str_stats)\n", "print(sql)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Run the Bigquery" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", " in runQuery ... \n", " this query processed 7757877633 bytes \n", " Approx. elpased time : 5605 miliseconds \n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
symbol1symbol2ncorrelationp-value
0SMAD4WDR71510.7387070.000000
1SMAD4ZNF241510.7324360.000000
2SMAD4TSHZ11510.7272850.000000
3SMAD4DTNA1510.7239870.000000
4SMAD4ELAC11510.7143810.000000
..................
11757SMAD4FAM111A151-0.1599060.049848
11758SMAD4ACER31510.1598780.049888
11759SMAD4LBR1510.1598400.049944
11760SMAD4TIMD41510.1598300.049957
11761SMAD4SSX4151-0.1598200.049972
\n", "

11762 rows × 5 columns

\n", "
" ], "text/plain": [ " symbol1 symbol2 n correlation p-value\n", "0 SMAD4 WDR7 151 0.738707 0.000000\n", "1 SMAD4 ZNF24 151 0.732436 0.000000\n", "2 SMAD4 TSHZ1 151 0.727285 0.000000\n", "3 SMAD4 DTNA 151 0.723987 0.000000\n", "4 SMAD4 ELAC1 151 0.714381 0.000000\n", "... ... ... ... ... ...\n", "11757 SMAD4 FAM111A 151 -0.159906 0.049848\n", "11758 SMAD4 ACER3 151 0.159878 0.049888\n", "11759 SMAD4 LBR 151 0.159840 0.049944\n", "11760 SMAD4 TIMD4 151 0.159830 0.049957\n", "11761 SMAD4 SSX4 151 -0.159820 0.049972\n", "\n", "[11762 rows x 5 columns]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bqclient = bigquery.Client()\n", "df_results = regulome.runQuery ( bqclient, sql, LabelList, SampleList, PatientList, dryRun=False )\n", "regulome.pvalues_dataframe( df_results )\n", "df_results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Analyze a pair of labels\n", "From the table above please select a pair of features names to perform a statistical analysis and display the data. You can print the variable 'pair_query' to obtain the query used to retrieve the data. \n", "**pair_query** is the query used to retreive the necessary data for the statistical test. " ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "07e247b24bd04c419e9872a89cc9469b", "version_major": 2, "version_minor": 0 }, "text/plain": [ "HBox(children=(HTML(value='Type label 1 '), Text(value='', placeholder='label name')))" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "86310807c04c4b7b836c4a12bf647f74", "version_major": 2, "version_minor": 0 }, "text/plain": [ "HBox(children=(HTML(value='Type label 2 '), Text(value='', placeholder='label name')))" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "[name1 , name2 ] = regulome.makeWidgetsPair()" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", " in runQuery ... \n", " this query processed 7757877633 bytes \n", " Approx. elpased time : 1264 miliseconds \n", "SpearmanrResult(correlation=0.7387068665040083, pvalue=2.6102587012639067e-27)\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "pair_query = regulome.get_query_pair(name1.value,name2.value,study.value,SampleList,feature1.value,feature2.value)\n", "#print(pair_query)\n", "df_pair = regulome.runQuery( bqclient, pair_query, LabelList, SampleList, PatientList, dryRun=False )\n", "regulome.plot_statistics_pair ( df_pair, feature2.value, name1.value, name2.value, size.value )\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" } }, "nbformat": 4, "nbformat_minor": 2 }