{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "Correlations_Protein_and_Gene_expression-CPTAC.ipynb",
"provenance": [],
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "9h1bEFTSoGyu"
},
"source": [
"# Compute correlations of protein and gene expression in CPTAC\n",
"\n",
"\n",
"```\n",
"Title: Correlations of protein and gene expression in CPTAC \n",
"Author: Boris Aguilar\n",
"Created: 05-23-2021\n",
"Purpose: Compute correlations between proteomic and gene expression available in the PDC \n",
"Notes: Runs in Google Colab \n",
"```\n",
"This notebook uses BigQuery to compute Pearson correlation between protein and gene expression for all the genes in the BigQuery tables of the PDC dataset. We used CCRCC as example; but this can be changed easily for other cancer types."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "BCpeagmasFFs"
},
"source": [
"## Modules"
]
},
{
"cell_type": "code",
"metadata": {
"id": "losf8GRlZvcM"
},
"source": [
"from google.cloud import bigquery\n",
"from google.colab import auth\n",
"import numpy as np\n",
"import pandas as pd\n",
"import seaborn as sns\n",
"import pandas_gbq"
],
"execution_count": 1,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "CnzNWzE3zS0H"
},
"source": [
"## Google Authentication\n",
"The first step is to authorize access to BigQuery and the Google Cloud. For more information see ['Quick Start Guide to ISB-CGC'](https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/HowToGetStartedonISB-CGC.html) and alternative authentication methods can be found [here](https://googleapis.dev/python/google-api-core/latest/auth.html).\n",
"\n",
"Moreover you need to [create a google cloud](https://cloud.google.com/resource-manager/docs/creating-managing-projects#console) project to be able to run BigQuery queries."
]
},
{
"cell_type": "code",
"metadata": {
"id": "2ySNqCskzONP"
},
"source": [
"auth.authenticate_user()\n",
"my_project_id = \"\" # write your project id here\n",
"bqclient = bigquery.Client( my_project_id )"
],
"execution_count": 2,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "v0y0FkrvBn4L"
},
"source": [
"## Retrieve protein expression of CCRCC\n",
"The following query will retrieve protein expression and case IDs from CPTAC table `quant_proteome_CPTAC_CCRCC_discovery_study_pdc_current`. Moreover, to label samples as Tumor or Normal samples we join the table with metadata available in the table `aliquot_to_case_mapping_pdc_current` "
]
},
{
"cell_type": "code",
"metadata": {
"id": "E2b6HN_cu-cn"
},
"source": [
"prot = '''quant AS (\n",
" SELECT meta.sample_submitter_id, meta.sample_type, quant.case_id, quant.aliquot_id, quant.gene_symbol, \n",
" CAST(quant.protein_abundance_log2ratio AS FLOAT64) AS protein_abundance_log2ratio \n",
" FROM `isb-cgc-bq.CPTAC.quant_proteome_CPTAC_CCRCC_discovery_study_pdc_current` as quant\n",
" JOIN `isb-cgc-bq.PDC_metadata.aliquot_to_case_mapping_current` as meta\n",
" ON quant.case_id = meta.case_id\n",
" AND quant.aliquot_id = meta.aliquot_id\n",
" AND meta.sample_type IN ('Primary Tumor','Solid Tissue Normal')\n",
")'''"
],
"execution_count": 3,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "G_TjJAtgvtA2"
},
"source": [
"## Retrieve gene expression of CCRCC\n",
"Next we retrieve gene expression data from the table `CPTAC.RNAseq_hg38_gdc_current` which contains RNA-seq data from all tumor types of CPTAC. Moreover we join the data with the metadata table `aliquot_to_case_mapping_pdc_current` to label samples to cancer or normal tissue"
]
},
{
"cell_type": "code",
"metadata": {
"id": "eAaHSre1v2cV"
},
"source": [
"gexp = '''gexp AS (\n",
" SELECT DISTINCT meta.sample_submitter_id, meta.sample_type, rnaseq.gene_name , LOG(rnaseq.HTSeq__FPKM + 1) as HTSeq__FPKM \n",
" FROM `isb-cgc-bq.CPTAC.RNAseq_hg38_gdc_current` as rnaseq\n",
" JOIN `isb-cgc-bq.PDC_metadata.aliquot_to_case_mapping_current` as meta\n",
" ON meta.sample_submitter_id = rnaseq.sample_barcode\n",
")'''"
],
"execution_count": 4,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "4IW69UHBwu2s"
},
"source": [
"## Compute Pearson correlation\n",
"The following query join the protein and gene expression data and compute correlation for each gene and semple type (normal or tumor)."
]
},
{
"cell_type": "code",
"metadata": {
"id": "kDSJ47hbw28a"
},
"source": [
"corr = '''correlation AS (\n",
" SELECT quant.gene_symbol, gexp.sample_type, COUNT(*) as n, CORR(protein_abundance_log2ratio,HTSeq__FPKM) as corr \n",
" FROM quant JOIN gexp \n",
" ON quant.sample_submitter_id = gexp.sample_submitter_id\n",
" AND gexp.gene_name = quant.gene_symbol\n",
" AND gexp.sample_type = quant.sample_type\n",
" GROUP BY quant.gene_symbol, gexp.sample_type\n",
")'''"
],
"execution_count": 5,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "TKStdGWhxsYQ"
},
"source": [
"## Compute p-values "
]
},
{
"cell_type": "code",
"metadata": {
"id": "g3QdUOQqxz3X"
},
"source": [
"pval = '''SELECT gene_symbol, sample_type, n, corr,\n",
" `cgc-05-0042.functions.corr_pvalue`(corr, n) as p\n",
"FROM correlation\n",
"WHERE ABS(corr) <= 1.0'''"
],
"execution_count": 6,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "wGjeJAGuyqbA"
},
"source": [
"## Adjust p-values\n",
"The following commands generate the final query which will be sent to Google to retrieve the final data that include the correlation for each gene. The query also includes a function (BHmultipletests) that adjusts the computed p values with the Benjamini-Hochberg method for multipletest correction."
]
},
{
"cell_type": "code",
"metadata": {
"id": "woduO0K9ywOM"
},
"source": [
"mysql = '''DECLARE Nrows INT64;\n",
"CREATE TEMP TABLE PearsonCorrelation AS\n",
"WITH {0}, \n",
"{1}, \n",
"{2} \n",
"{3}\n",
";\n",
"# Adjust pvalues for multiple tests\n",
"SET Nrows = ( SELECT COUNT(*) FROM PearsonCorrelation );\n",
"CALL `cgc-05-0042.functions.BHmultipletests`( 'PearsonCorrelation', 'p', Nrows )\n",
"'''.format(prot, gexp, corr, pval)"
],
"execution_count": 7,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "BcNc2B_hMmfn"
},
"source": [
"## Run the query to retrieve the analysis "
]
},
{
"cell_type": "code",
"metadata": {
"id": "0RzNRC16Mv11"
},
"source": [
"job_config = bigquery.QueryJobConfig()\n",
"job_config.use_legacy_sql = False\n",
"try:\n",
" query_job = bqclient.query ( mysql, job_config=job_config )\n",
"except:\n",
" print ( \" FATAL ERROR: query execution failed \" )\n",
"mydf = query_job.to_dataframe()"
],
"execution_count": 8,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "WUIdZMflFgTX"
},
"source": [
"The following command displays the results."
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 419
},
"id": "9840zbeiFMhG",
"outputId": "805f0bb8-4f05-4968-a190-b1f724c5d130"
},
"source": [
"mydf"
],
"execution_count": 9,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"
| \n", " | gene_symbol | \n", "sample_type | \n", "n | \n", "corr | \n", "p | \n", "p_adj | \n", "
|---|---|---|---|---|---|---|
| 0 | \n", "ANXA13 | \n", "Primary Tumor | \n", "100 | \n", "0.955971 | \n", "1.205495e-53 | \n", "1.767496e-49 | \n", "
| 1 | \n", "MYO1B | \n", "Primary Tumor | \n", "100 | \n", "0.952969 | \n", "2.759249e-52 | \n", "2.022806e-48 | \n", "
| 2 | \n", "CES1 | \n", "Primary Tumor | \n", "100 | \n", "0.949441 | \n", "8.494844e-51 | \n", "4.151714e-47 | \n", "
| 3 | \n", "PHYHIPL | \n", "Primary Tumor | \n", "100 | \n", "0.942866 | \n", "2.748145e-48 | \n", "1.007332e-44 | \n", "
| 4 | \n", "LGALS3 | \n", "Primary Tumor | \n", "100 | \n", "0.941492 | \n", "8.429648e-48 | \n", "2.471910e-44 | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
| 14657 | \n", "AHSA1 | \n", "Solid Tissue Normal | \n", "75 | \n", "0.000278 | \n", "9.981109e-01 | \n", "9.983833e-01 | \n", "
| 14658 | \n", "CIC | \n", "Primary Tumor | \n", "100 | \n", "-0.000204 | \n", "9.983915e-01 | \n", "9.985958e-01 | \n", "
| 14659 | \n", "SARM1 | \n", "Solid Tissue Normal | \n", "75 | \n", "0.000200 | \n", "9.986388e-01 | \n", "9.987750e-01 | \n", "
| 14660 | \n", "C7orf26 | \n", "Solid Tissue Normal | \n", "75 | \n", "0.000131 | \n", "9.991117e-01 | \n", "9.991798e-01 | \n", "
| 14661 | \n", "TFB1M | \n", "Solid Tissue Normal | \n", "75 | \n", "-0.000006 | \n", "9.999588e-01 | \n", "9.999588e-01 | \n", "
14662 rows × 6 columns
\n", "