{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "name": "Correlations_Protein_and_Gene_expression-CPTAC.ipynb", "provenance": [], "include_colab_link": true }, "kernelspec": { "name": "python3", "display_name": "Python 3" } }, "cells": [ { "cell_type": "markdown", "metadata": { "id": "view-in-github", "colab_type": "text" }, "source": [ "" ] }, { "cell_type": "markdown", "metadata": { "id": "9h1bEFTSoGyu" }, "source": [ "# Compute correlations of protein and gene expression in CPTAC\n", "\n", "\n", "```\n", "Title: Correlations of protein and gene expression in CPTAC \n", "Author: Boris Aguilar\n", "Created: 05-23-2021\n", "Purpose: Compute correlations between proteomic and gene expression available in the PDC \n", "Notes: Runs in Google Colab \n", "```\n", "This notebook uses BigQuery to compute Pearson correlation between protein and gene expression for all the genes in the BigQuery tables of the PDC dataset. We used CCRCC as example; but this can be changed easily for other cancer types." ] }, { "cell_type": "markdown", "metadata": { "id": "BCpeagmasFFs" }, "source": [ "## Modules" ] }, { "cell_type": "code", "metadata": { "id": "losf8GRlZvcM" }, "source": [ "from google.cloud import bigquery\n", "from google.colab import auth\n", "import numpy as np\n", "import pandas as pd\n", "import seaborn as sns\n", "import pandas_gbq" ], "execution_count": 1, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "CnzNWzE3zS0H" }, "source": [ "## Google Authentication\n", "The first step is to authorize access to BigQuery and the Google Cloud. For more information see ['Quick Start Guide to ISB-CGC'](https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/HowToGetStartedonISB-CGC.html) and alternative authentication methods can be found [here](https://googleapis.dev/python/google-api-core/latest/auth.html).\n", "\n", "Moreover you need to [create a google cloud](https://cloud.google.com/resource-manager/docs/creating-managing-projects#console) project to be able to run BigQuery queries." ] }, { "cell_type": "code", "metadata": { "id": "2ySNqCskzONP" }, "source": [ "auth.authenticate_user()\n", "my_project_id = \"\" # write your project id here\n", "bqclient = bigquery.Client( my_project_id )" ], "execution_count": 2, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "v0y0FkrvBn4L" }, "source": [ "## Retrieve protein expression of CCRCC\n", "The following query will retrieve protein expression and case IDs from CPTAC table `quant_proteome_CPTAC_CCRCC_discovery_study_pdc_current`. Moreover, to label samples as Tumor or Normal samples we join the table with metadata available in the table `aliquot_to_case_mapping_pdc_current` " ] }, { "cell_type": "code", "metadata": { "id": "E2b6HN_cu-cn" }, "source": [ "prot = '''quant AS (\n", " SELECT meta.sample_submitter_id, meta.sample_type, quant.case_id, quant.aliquot_id, quant.gene_symbol, \n", " CAST(quant.protein_abundance_log2ratio AS FLOAT64) AS protein_abundance_log2ratio \n", " FROM `isb-cgc-bq.CPTAC.quant_proteome_CPTAC_CCRCC_discovery_study_pdc_current` as quant\n", " JOIN `isb-cgc-bq.PDC_metadata.aliquot_to_case_mapping_current` as meta\n", " ON quant.case_id = meta.case_id\n", " AND quant.aliquot_id = meta.aliquot_id\n", " AND meta.sample_type IN ('Primary Tumor','Solid Tissue Normal')\n", ")'''" ], "execution_count": 3, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "G_TjJAtgvtA2" }, "source": [ "## Retrieve gene expression of CCRCC\n", "Next we retrieve gene expression data from the table `CPTAC.RNAseq_hg38_gdc_current` which contains RNA-seq data from all tumor types of CPTAC. Moreover we join the data with the metadata table `aliquot_to_case_mapping_pdc_current` to label samples to cancer or normal tissue" ] }, { "cell_type": "code", "metadata": { "id": "eAaHSre1v2cV" }, "source": [ "gexp = '''gexp AS (\n", " SELECT DISTINCT meta.sample_submitter_id, meta.sample_type, rnaseq.gene_name , LOG(rnaseq.HTSeq__FPKM + 1) as HTSeq__FPKM \n", " FROM `isb-cgc-bq.CPTAC.RNAseq_hg38_gdc_current` as rnaseq\n", " JOIN `isb-cgc-bq.PDC_metadata.aliquot_to_case_mapping_current` as meta\n", " ON meta.sample_submitter_id = rnaseq.sample_barcode\n", ")'''" ], "execution_count": 4, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "4IW69UHBwu2s" }, "source": [ "## Compute Pearson correlation\n", "The following query join the protein and gene expression data and compute correlation for each gene and semple type (normal or tumor)." ] }, { "cell_type": "code", "metadata": { "id": "kDSJ47hbw28a" }, "source": [ "corr = '''correlation AS (\n", " SELECT quant.gene_symbol, gexp.sample_type, COUNT(*) as n, CORR(protein_abundance_log2ratio,HTSeq__FPKM) as corr \n", " FROM quant JOIN gexp \n", " ON quant.sample_submitter_id = gexp.sample_submitter_id\n", " AND gexp.gene_name = quant.gene_symbol\n", " AND gexp.sample_type = quant.sample_type\n", " GROUP BY quant.gene_symbol, gexp.sample_type\n", ")'''" ], "execution_count": 5, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "TKStdGWhxsYQ" }, "source": [ "## Compute p-values " ] }, { "cell_type": "code", "metadata": { "id": "g3QdUOQqxz3X" }, "source": [ "pval = '''SELECT gene_symbol, sample_type, n, corr,\n", " `cgc-05-0042.functions.corr_pvalue`(corr, n) as p\n", "FROM correlation\n", "WHERE ABS(corr) <= 1.0'''" ], "execution_count": 6, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "wGjeJAGuyqbA" }, "source": [ "## Adjust p-values\n", "The following commands generate the final query which will be sent to Google to retrieve the final data that include the correlation for each gene. The query also includes a function (BHmultipletests) that adjusts the computed p values with the Benjamini-Hochberg method for multipletest correction." ] }, { "cell_type": "code", "metadata": { "id": "woduO0K9ywOM" }, "source": [ "mysql = '''DECLARE Nrows INT64;\n", "CREATE TEMP TABLE PearsonCorrelation AS\n", "WITH {0}, \n", "{1}, \n", "{2} \n", "{3}\n", ";\n", "# Adjust pvalues for multiple tests\n", "SET Nrows = ( SELECT COUNT(*) FROM PearsonCorrelation );\n", "CALL `cgc-05-0042.functions.BHmultipletests`( 'PearsonCorrelation', 'p', Nrows )\n", "'''.format(prot, gexp, corr, pval)" ], "execution_count": 7, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "BcNc2B_hMmfn" }, "source": [ "## Run the query to retrieve the analysis " ] }, { "cell_type": "code", "metadata": { "id": "0RzNRC16Mv11" }, "source": [ "job_config = bigquery.QueryJobConfig()\n", "job_config.use_legacy_sql = False\n", "try:\n", " query_job = bqclient.query ( mysql, job_config=job_config )\n", "except:\n", " print ( \" FATAL ERROR: query execution failed \" )\n", "mydf = query_job.to_dataframe()" ], "execution_count": 8, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "WUIdZMflFgTX" }, "source": [ "The following command displays the results." ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 419 }, "id": "9840zbeiFMhG", "outputId": "805f0bb8-4f05-4968-a190-b1f724c5d130" }, "source": [ "mydf" ], "execution_count": 9, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", " | gene_symbol | \n", "sample_type | \n", "n | \n", "corr | \n", "p | \n", "p_adj | \n", "
---|---|---|---|---|---|---|
0 | \n", "ANXA13 | \n", "Primary Tumor | \n", "100 | \n", "0.955971 | \n", "1.205495e-53 | \n", "1.767496e-49 | \n", "
1 | \n", "MYO1B | \n", "Primary Tumor | \n", "100 | \n", "0.952969 | \n", "2.759249e-52 | \n", "2.022806e-48 | \n", "
2 | \n", "CES1 | \n", "Primary Tumor | \n", "100 | \n", "0.949441 | \n", "8.494844e-51 | \n", "4.151714e-47 | \n", "
3 | \n", "PHYHIPL | \n", "Primary Tumor | \n", "100 | \n", "0.942866 | \n", "2.748145e-48 | \n", "1.007332e-44 | \n", "
4 | \n", "LGALS3 | \n", "Primary Tumor | \n", "100 | \n", "0.941492 | \n", "8.429648e-48 | \n", "2.471910e-44 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
14657 | \n", "AHSA1 | \n", "Solid Tissue Normal | \n", "75 | \n", "0.000278 | \n", "9.981109e-01 | \n", "9.983833e-01 | \n", "
14658 | \n", "CIC | \n", "Primary Tumor | \n", "100 | \n", "-0.000204 | \n", "9.983915e-01 | \n", "9.985958e-01 | \n", "
14659 | \n", "SARM1 | \n", "Solid Tissue Normal | \n", "75 | \n", "0.000200 | \n", "9.986388e-01 | \n", "9.987750e-01 | \n", "
14660 | \n", "C7orf26 | \n", "Solid Tissue Normal | \n", "75 | \n", "0.000131 | \n", "9.991117e-01 | \n", "9.991798e-01 | \n", "
14661 | \n", "TFB1M | \n", "Solid Tissue Normal | \n", "75 | \n", "-0.000006 | \n", "9.999588e-01 | \n", "9.999588e-01 | \n", "
14662 rows × 6 columns
\n", "