{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "name": "How_to_compare_protein_and_gene_expression-CPTAC.ipynb", "provenance": [], "authorship_tag": "ABX9TyNyW9h7pqQqfltHky0wxLAF", "include_colab_link": true }, "kernelspec": { "name": "python3", "display_name": "Python 3" }, "language_info": { "name": "python" } }, "cells": [ { "cell_type": "markdown", "metadata": { "id": "view-in-github", "colab_type": "text" }, "source": [ "" ] }, { "cell_type": "markdown", "metadata": { "id": "9h1bEFTSoGyu" }, "source": [ "# How to compare protein and gene expression in the CPTAC dataset\n", "\n", "```\n", "Title: How to compare protein and gene expression in the CPTAC dataset\n", "Author: Boris Aguilar\n", "Created: 06-21-2021\n", "Purpose: Retrieve protein and gene expression from the CPTAC BigQuery tables and visualize the data \n", "Notes: Runs in Google Colab \n", "```\n", "This notebook uses BigQuery to retrieve protein and gene expression for given gene in the CPTAC dataset. Scatter plots are then generated to compare these two types of data.\n", "\n", "For this example we used Clear cell renal cell carcinoma (CCRCC). Other cancer types can be easily used." ] }, { "cell_type": "markdown", "metadata": { "id": "wGLv8IvUeUxZ" }, "source": [ "## Modules" ] }, { "cell_type": "code", "metadata": { "id": "cTzJM0NNZyp1" }, "source": [ "from google.cloud import bigquery\n", "from google.colab import auth\n", "import pandas as pd\n", "import seaborn as sns\n", "import pandas_gbq" ], "execution_count": 20, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "zh3KALqEeh52" }, "source": [ "## Google Authentication\n", "The first step is to authorize access to BigQuery and the Google Cloud. For more information see ['Quick Start Guide to ISB-CGC'](https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/HowToGetStartedonISB-CGC.html) and alternative authentication methods can be found [here](https://googleapis.dev/python/google-api-core/latest/auth.html).\n", "\n", "Moreover you need to [create a google cloud](https://cloud.google.com/resource-manager/docs/creating-managing-projects#console) project to be able to run BigQuery queries." ] }, { "cell_type": "code", "metadata": { "id": "_k7uYY2yemqO" }, "source": [ "auth.authenticate_user()\n", "my_project_id = \"\" # write your project id here\n", "bqclient = bigquery.Client( my_project_id )" ], "execution_count": 21, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "hblgqohrk-9x" }, "source": [ "## Parameters\n", "For this experiments we need to setup the gene name, the name of the table with protein expression data for CCRCC (for other cancer types one needs to change this table name), and the name of the table with gene expression data." ] }, { "cell_type": "code", "metadata": { "id": "trpM6GBelCDy" }, "source": [ "gene_name = 'RAB5A'\n", "protein_table = 'isb-cgc-bq.CPTAC_versioned.quant_proteome_CPTAC_CCRCC_discovery_study_pdc_V1_21'\n", "gexp_table = 'isb-cgc-bq.CPTAC_versioned.RNAseq_hg38_gdc_r28'" ], "execution_count": 12, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "ytrC8uEJfBm0" }, "source": [ "## Retrieve protein expression of CCRCC\n", "\n", "The following query will retrieve protein expression and case IDs from the given CPTAC table. Moreover, to label samples as Tumor or Normal samples we join the table with metadata available in the table `aliquot_to_case_mapping_V1_21`." ] }, { "cell_type": "code", "metadata": { "id": "8q1ABkJgjiye" }, "source": [ "prot = '''quant AS (\n", " SELECT meta.sample_submitter_id, meta.sample_type, quant.case_id, quant.aliquot_id, quant.gene_symbol, \n", " CAST(quant.protein_abundance_log2ratio AS FLOAT64) AS protein_abundance_log2ratio \n", " FROM `{0}` as quant\n", " JOIN `isb-cgc-bq.PDC_metadata_versioned.aliquot_to_case_mapping_V1_21` as meta\n", " ON quant.case_id = meta.case_id\n", " AND quant.aliquot_id = meta.aliquot_id\n", " AND quant.gene_symbol = '{1}'\n", ")\n", "'''.format(protein_table,gene_name)" ], "execution_count": 13, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "XN8WcXVNj837" }, "source": [ "## Retrieve gene expression of CCRCC\n", "Next we retrieve gene expression data from the table `CPTAC_versioned.RNAseq_hg38_gdc_r28` which contains RNA-seq data from all tumor types of CPTAC. Moreover we join the data with the metadata table `aliquot_to_case_mapping_pdc_current` to label samples as cancer or normal tissue." ] }, { "cell_type": "code", "metadata": { "id": "erNEYfh-j-kZ" }, "source": [ "gexp = '''gexp AS (\n", " SELECT DISTINCT meta.sample_submitter_id, meta.sample_type, rnaseq.gene_name , LOG(rnaseq.HTSeq__FPKM + 1) as HTSeq__FPKM \n", " FROM `{0}` as rnaseq\n", " JOIN `isb-cgc-bq.PDC_metadata.aliquot_to_case_mapping_current` as meta\n", " ON meta.sample_submitter_id = rnaseq.sample_barcode\n", " AND rnaseq.gene_name = '{1}'\n", ")\n", "'''.format(gexp_table, gene_name)" ], "execution_count": 14, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "OW02y48ilrmm" }, "source": [ "## Run the query to retrieve the data" ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 376 }, "id": "f7blfiIAlqLe", "outputId": "8ff33d6c-01b2-4c1a-cac1-92bde450e6dc" }, "source": [ "mysql = ( 'WITH ' + prot + ',' + gexp + \n", "'''\n", "SELECT quant.sample_submitter_id, quant.sample_type, quant.gene_symbol, \n", " quant.protein_abundance_log2ratio, gexp.HTSeq__FPKM\n", "FROM quant\n", "JOIN gexp \n", "ON gexp.sample_submitter_id = quant.sample_submitter_id\n", "''' )\n", "\n", "df2 = pandas_gbq.read_gbq(mysql,project_id=my_project_id )\n", "df2[0:10]" ], "execution_count": 19, "outputs": [ { "output_type": "stream", "text": [ "Downloading: 100%|██████████| 175/175 [00:00<00:00, 847.19rows/s]\n" ], "name": "stderr" }, { "output_type": "execute_result", "data": { "text/html": [ "
\n", " | sample_submitter_id | \n", "sample_type | \n", "gene_symbol | \n", "protein_abundance_log2ratio | \n", "HTSeq__FPKM | \n", "
---|---|---|---|---|---|
0 | \n", "C3N-00494-05 | \n", "Solid Tissue Normal | \n", "RAB5A | \n", "-0.0237 | \n", "2.854013 | \n", "
1 | \n", "C3L-00011-01 | \n", "Primary Tumor | \n", "RAB5A | \n", "-0.0956 | \n", "2.768813 | \n", "
2 | \n", "C3N-01261-06 | \n", "Solid Tissue Normal | \n", "RAB5A | \n", "0.3311 | \n", "2.779552 | \n", "
3 | \n", "C3L-01287-02 | \n", "Primary Tumor | \n", "RAB5A | \n", "-0.1559 | \n", "2.621475 | \n", "
4 | \n", "C3N-00852-06 | \n", "Solid Tissue Normal | \n", "RAB5A | \n", "-0.1130 | \n", "2.932494 | \n", "
5 | \n", "C3L-01607-06 | \n", "Solid Tissue Normal | \n", "RAB5A | \n", "-0.3805 | \n", "2.866062 | \n", "
6 | \n", "C3L-01836-02 | \n", "Primary Tumor | \n", "RAB5A | \n", "-0.4295 | \n", "2.404121 | \n", "
7 | \n", "C3L-01302-03 | \n", "Primary Tumor | \n", "RAB5A | \n", "-0.3248 | \n", "2.495302 | \n", "
8 | \n", "C3N-00244-06 | \n", "Solid Tissue Normal | \n", "RAB5A | \n", "0.3032 | \n", "2.679795 | \n", "
9 | \n", "C3L-00907-06 | \n", "Solid Tissue Normal | \n", "RAB5A | \n", "0.1733 | \n", "2.773120 | \n", "