{ "cells": [ { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "7U3xdSiq1N4f" }, "source": [ "# Regulome Explorer Notebook 1\n", "\n", "Gene-gene correlations using PanCancer expression and copy number data.\n" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "eTxa1hqJiDDy" }, "source": [ "## Authentication and Setup" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "Q9W2cFJqiGL3" }, "source": [ "### Authenticate with Google (IMPORTANT)\n", "Our first step is to authenticate with Google -- you will need to be a member of a Google Cloud Platform (GCP) project, with authorization to run BigQuery jobs in order to run this notebook. If you don't have access to a GCP project, please contact the ISB-CGC team for help (www.isb-cgc.org)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Your browser has been opened to visit:\n", "\n", " https://accounts.google.com/o/oauth2/auth?redirect_uri=http%3A%2F%2Flocalhost%3A8085%2F&prompt=select_account&response_type=code&client_id=32555940559.apps.googleusercontent.com&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fappengine.admin+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcompute+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Faccounts.reauth&access_type=offline\n", "\n", "\n", "[11806:11825:0508/110154.885591:ERROR:browser_process_sub_thread.cc(217)] Waited 6 ms for network service\n", "Opening in existing browser session.\n", "\u001b[1;33mWARNING:\u001b[0m `gcloud auth login` no longer writes application default credentials.\n", "If you need to use ADC, see:\n", " gcloud auth application-default --help\n", "\n", "You are now logged in as [dgibbs@systemsbiology.org].\n", "Your current project is [isb-cgc-02-0001]. You can change this setting by running:\n", " $ gcloud config set project PROJECT_ID\n" ] } ], "source": [ "# Running notebook locally...\n", "!gcloud auth login\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "colab_type": "code", "id": "JR6NGONAiGL5", "outputId": "52affb77-9d0d-4049-ab14-3b6a79cf67ea" }, "outputs": [], "source": [ "## IF YOU'RE IN GOOGLE COLAB ##\n", "\n", "from google.colab import auth\n", "try:\n", " auth.authenticate_user()\n", " print('You have been successfully authenticated!')\n", "except:\n", " print('You have not been authenticated.')" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "fWCknLBUiGMA" }, "source": [ "### Initialize connection to BigQuery (IMPORTANT)\n", "Once you're authenticated, we'll begin getting set up to pull data out of BigQuery. \n", "\n", "The first step is to initialize the BigQuery client. This requires specifying a Google Cloud Platform (GCP) **project id** in which you have the necessary privileges (also referred to as \"roles\") to execute queries and access the data used by this notebook.\n", "\n", "Below, we use a GCP project called ``isb-cgc-bq``. If you have your own GCP project that you would like to use instead, please **edit** the line below that sets the ``project_id`` variable *before* running the next cell." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "colab_type": "code", "id": "fvLoeBq7iGMB", "outputId": "1409751c-b783-4b21-c7b6-af3ea5e48b87" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/davidgibbs/.local/lib/python3.6/site-packages/google/auth/_default.py:66: UserWarning: Your application has authenticated using end user credentials from Google Cloud SDK. We recommend that most server applications use service accounts instead. If your application continues to use end user credentials from Cloud SDK, you might receive a \"quota exceeded\" or \"API not enabled\" error. For more information about service accounts, see https://cloud.google.com/docs/authentication/\n", " warnings.warn(_CLOUD_SDK_CREDENTIALS_WARNING)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "BigQuery client successfully initialized\n" ] } ], "source": [ "from google.cloud import bigquery\n", "\n", "try:\n", " project_id = 'isb-cgc-02-0001'\n", " bqclient = bigquery.Client(project=project_id)\n", " print('BigQuery client successfully initialized')\n", "except:\n", " print('Failure to initialize BigQuery client')" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "v-3gnS2hsQau" }, "source": [ "### Software Configuration\n", "\n", "Unless any part of this section fails to run, you only need to look at this in detail if you are interested in understanding how this notebook works or in modifying it for your own purposes. \n", "\n", "Note that running this section alone will take about **5 minutes** due to the sizes of some of the libraries being imported, as well as the number of dependent libraries. \n", "\n", "Of course, once this section has been run, you will be able to run, modify, and re-run any of the later sections of this notebook without having to come back and rerun this section (as long as you have not lost your connection to the Jupyter server)." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "i-g6ePPSZ7eT" }, "source": [ "#### Import Python libraries\n", "\n", "Import NumPy, Pandas, and seaborn" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "colab": {}, "colab_type": "code", "id": "GV1ir0jGZzch" }, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import seaborn as sns" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "ig6oNOMoWHRc" }, "source": [ "#### Python convenience functions\n", "\n", "We define two convenience functions here:\n", "\n", "- **`runQuery`**: a relatively generic BigQuery query-execution wrapper function which can be used to run a query in \"dry-run\" mode or not: the call to the `query()` function itself is inside a `try/except` block and if it fails we return `None`; otherwise a \"dry\" will return an empty dataframe, and a \"live\" run will return the query results as a dataframe\n", "\n", "- **`checkQueryResults`**: a generic function that makes sure that what was returned is a dataframe, and checks how many rows are in the returned dataframe" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "colab": {}, "colab_type": "code", "id": "rbY7BD2xWGET" }, "outputs": [], "source": [ "def runQuery ( client, qString, dryRun=False ):\n", " \n", " print ( \"\\n in runQuery ... \" )\n", " if ( dryRun ):\n", " print ( \" dry-run only \" )\n", " \n", " ## set up QueryJobConfig object\n", " job_config = bigquery.QueryJobConfig()\n", " job_config.dry_run = dryRun\n", " job_config.use_query_cache = True\n", " job_config.use_legacy_sql = False\n", " \n", " ## run the query\n", " try:\n", " query_job = client.query ( qString, job_config=job_config )\n", " ## print ( \" query job state: \", query_job.state )\n", " except:\n", " print ( \" FATAL ERROR: query execution failed \" )\n", " return ( None )\n", " \n", " ## return results as a dataframe (or an empty dataframe for a dry-run) \n", " if ( not dryRun ):\n", " try:\n", " df = query_job.to_dataframe()\n", " if ( query_job.total_bytes_processed==0 ):\n", " print ( \" the results for this query were previously cached \" )\n", " else:\n", " print ( \" this query processed {} bytes \".format(query_job.total_bytes_processed) )\n", " if ( len(df) < 1 ):\n", " print ( \" WARNING: this query returned NO results \")\n", " return ( df )\n", " except:\n", " print ( \" FATAL ERROR: query execution failed \" )\n", " return ( None )\n", " \n", " else:\n", " print ( \" if not cached, this query will process {} bytes \".format(query_job.total_bytes_processed) )\n", " ## return an empty dataframe\n", " return ( pd.DataFrame() )\n", " \n", "\n", "\n", "def checkQueryResults ( qr ):\n", " print ( \"\\n in checkQueryResults ... \" )\n", " if ( not isinstance(qr, pd.DataFrame) ):\n", " print ( \" query execution failed! \" )\n", " return ( False )\n", " else:\n", " if ( len(qr) > 0 ): \n", " print ( \" # of rows in query results: {} \".format(len(qr)) )\n", " print ( \"\\n\", qr.head() )\n", " else:\n", " print ( \" query returned NO results ?!? \" ) \n", " return ( True )" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "dOiLP6godt9-" }, "source": [ "### SQL Building Functions\n" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "colab": {}, "colab_type": "code", "id": "flysN5k4duVC" }, "outputs": [], "source": [ "def build_cohort ( study ):\n", " qString = \"\"\"\n", " WITH\n", " --\n", " -- we start with the clinical table\n", " --\n", " cohort AS (\n", " SELECT\n", " acronym as Study,\n", " bcr_patient_barcode as ParticipantBarcode\n", " FROM\n", " `isb-cgc-01-0008.Filtered.clinical_PANCAN_patient_with_followup_filtered` \n", " WHERE\n", " acronym = '__study__'\n", " )\n", " \"\"\".replace('__study__',study)\n", " return(qString)\n", " " ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "colab": {}, "colab_type": "code", "id": "L2qVcsmyd6NA" }, "outputs": [], "source": [ "def select_genes ( dtype, n ):\n", " if dtype == 'expr':\n", " qString = \"\"\"\n", " selected_genes AS (\n", " SELECT\n", " Symbol,\n", " STDDEV(normalized_count) AS sigmaExp\n", " FROM\n", " `isb-cgc-01-0008.Filtered.EBpp_AdjustPANCAN_RNASeqV2_filtered`\n", " WHERE\n", " Symbol IS NOT NULL\n", " AND ParticipantBarcode IN (\n", " SELECT\n", " ParticipantBarcode\n", " FROM\n", " cohort)\n", " GROUP BY\n", " 1\n", " ORDER BY\n", " sigmaExp DESC\n", " LIMIT\n", " __n__ )\n", " \"\"\".replace('__n__', str(n))\n", " \n", " elif dtype == 'cnv':\n", " qString = \"\"\"\n", " selected_genes AS (\n", " SELECT\n", " Gene_Symbols as Symbol,\n", " STDDEV(GISTIC_Calls) AS sigmaExp\n", " FROM\n", " `isb-cgc-01-0008.Filtered.all_CNVR_data_by_gene_filtered`\n", " WHERE\n", " Symbol IS NOT NULL\n", " AND ParticipantBarcode IN (\n", " SELECT\n", " ParticipantBarcode\n", " FROM\n", " cohort)\n", " GROUP BY\n", " 1\n", " ORDER BY\n", " sigmaExp DESC\n", " LIMIT\n", " __n__ )\n", " \"\"\".replace('__n__', str(n))\n", " else:\n", " # \n", " # Could be some other gene selection function here ##\n", " #\n", " qString = ''\n", " return(qString)" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "colab": {}, "colab_type": "code", "id": "Kvc3PWFxliOH" }, "outputs": [], "source": [ "def get_expr_data(i):\n", " qString = \"\"\"\n", " expr_data__i__ AS (\n", " SELECT\n", " Symbol,\n", " ParticipantBarcode,\n", " SampleBarcode,\n", " normalized_count as expr,\n", " DENSE_RANK() OVER (PARTITION BY SampleBarcode ORDER BY normalized_count ASC) AS rankExpr\n", " FROM\n", " `isb-cgc-01-0008.Filtered.EBpp_AdjustPANCAN_RNASeqV2_filtered`\n", " WHERE\n", " Symbol IS NOT NULL AND\n", " Symbol IN (\n", " SELECT\n", " Symbol\n", " FROM\n", " selected_genes)\n", " AND ParticipantBarcode IN (\n", " SELECT\n", " ParticipantBarcode\n", " FROM\n", " cohort) \n", " )\n", " \"\"\".replace('__i__', '_'+str(i))\n", " return(qString)\n" ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "colab": {}, "colab_type": "code", "id": "J_VxV7_sfmZI" }, "outputs": [], "source": [ " def join_data(i,j):\n", " qString = \"\"\"\n", " j_data AS (\n", " SELECT\n", " expr_data_1.Symbol as expr_data_1_Symbol,\n", " expr_data_1.SampleBarcode,\n", " expr_data_1.rankExpr as expr_data_1_rankExpr,\n", " expr_data_1.expr,\n", " expr_data_2.Symbol as expr_data_2_Symbol,\n", " expr_data_2.SampleBarcode,\n", " expr_data_2.rankExpr as expr_data_2_rankExpr,\n", " expr_data_2.expr\n", " FROM\n", " expr_data__i__ JOIN expr_data__j__ \n", " ON\n", " expr_data__i__.SampleBarcode = expr_data__j__.SampleBarcode\n", " AND expr_data__i__.Symbol > expr_data__j__.Symbol \n", " )\n", " \"\"\".replace('__i__', '_'+str(i)).replace('__j__', '_'+str(j))\n", " return(qString)" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "colab": {}, "colab_type": "code", "id": "fXQlABc1u-EU" }, "outputs": [], "source": [ "def comp_corr(i,j):\n", " qString = \"\"\"\n", " corr_table AS (\n", " SELECT\n", " expr_data__i___Symbol,\n", " expr_data__j___Symbol,\n", " CORR(expr_data__i___rankExpr,expr_data__j___rankExpr) AS corr\n", " FROM\n", " j_data\n", " GROUP BY\n", " 1,2\n", " ORDER BY\n", " corr DESC\n", " )\n", " \"\"\".replace('__i__', '_'+str(i)).replace('__j__', '_'+str(j))\n", " return(qString)\n" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "colab": {}, "colab_type": "code", "id": "uwBhaorUy0zO" }, "outputs": [], "source": [ "def final( last_table ):\n", " qString = \"\"\"\n", " SELECT * FROM __last_table__\n", " \"\"\".replace('__last_table__', last_table)\n", " return(qString)\n" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "colab": {}, "colab_type": "code", "id": "c13LfwSbzx_I" }, "outputs": [], "source": [ "def build_query( study, dtype, size ): \n", " i = '1'\n", " j = '2'\n", " sql = (\n", " build_cohort( study ) + ',\\n' +\n", " select_genes ( dtype, size ) + ',' +\n", " get_expr_data(i) + ',\\n' +\n", " get_expr_data(j) + ',\\n' + #### change here\n", " join_data(i,j) + ',\\n' +\n", " comp_corr(i,j) + '\\n' +\n", " final( 'corr_table' )\n", " )\n", " return(sql)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "ZWIIjxKuiGMG" }, "source": [] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "By0ap5rf1NKq" }, "source": [ "## Select variables to analyze\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "g_d7LV6jivX5" }, "source": [ "Choose a TCGA study, number of genes, and the data source to use in ranking genes." ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "colab": {}, "colab_type": "code", "id": "vLYGUSVRjYP4" }, "outputs": [], "source": [ "# select a tumor type\n", "\n", "studyList = [ 'ACC', 'BLCA', 'BRCA', 'CESC', 'CHOL', 'COAD', 'DLBC', 'ESCA', \n", " 'GBM', 'HNSC', 'KICH', 'KIRC', 'KIRP', 'LAML', 'LGG', 'LIHC', \n", " 'LUAD', 'LUSC', 'MESO', 'OV', 'PAAD', 'PCPG', 'PRAD', 'READ', \n", " 'SARC', 'SKCM', 'STAD', 'TGCT', 'THCA', 'THYM', 'UCEC', 'UCS', 'UVM' ]\n", "\n", "study = studyList[6]\n", "\n", "# choosing which genes to look at\n", "n = 100 # the n most variable genes\n", "\n", "# building the query string\n", "sql = build_query (study, 'expr', n)\n", "\n", "# print(sql)" ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 68 }, "colab_type": "code", "id": "E7jT5HX9jHOq", "outputId": "a121ab60-9a88-49d5-f0b0-598e82d766cd" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", " in runQuery ... \n", " this query processed 10416423703 bytes \n" ] } ], "source": [ "# calling Google! #\n", "\n", "res0 = runQuery ( bqclient, sql, dryRun=False )\n" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "FyTzLgdN2CaT" }, "source": [ "## Results\n" ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "colab_type": "code", "id": "nWM3AVOhGEN-", "outputId": "6331c5b6-165a-40ba-a66e-49a78c9cd38d" }, "outputs": [ { "data": { "text/plain": [ "(4950, 3)" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "res0.shape" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [], "source": [ "res0['index'] = range(0,4950)" ] }, { "cell_type": "code", "execution_count": 56, "metadata": { "colab": {}, "colab_type": "code", "id": "qW85FrbELL0V" }, "outputs": [ { "data": { "text/html": [ "
\n", " | expr_data_1_Symbol | \n", "expr_data_2_Symbol | \n", "corr | \n", "index | \n", "
---|---|---|---|---|
0 | \n", "PGA4 | \n", "LIPF | \n", "0.999086 | \n", "0 | \n", "
1 | \n", "PGC | \n", "PGA4 | \n", "0.998607 | \n", "1 | \n", "
2 | \n", "PGC | \n", "LIPF | \n", "0.998592 | \n", "2 | \n", "
3 | \n", "PGA3 | \n", "LIPF | \n", "0.998415 | \n", "3 | \n", "
4 | \n", "PGA4 | \n", "PGA3 | \n", "0.998272 | \n", "4 | \n", "
5 | \n", "PGC | \n", "PGA3 | \n", "0.997350 | \n", "5 | \n", "
6 | \n", "C1QB | \n", "C1QA | \n", "0.958305 | \n", "6 | \n", "
7 | \n", "COL3A1 | \n", "COL1A1 | \n", "0.942048 | \n", "7 | \n", "
8 | \n", "COL1A2 | \n", "COL1A1 | \n", "0.933614 | \n", "8 | \n", "
9 | \n", "COL3A1 | \n", "COL1A2 | \n", "0.900852 | \n", "9 | \n", "