{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "How_to_explore_CPTAC_protein_abundances.ipynb",
"provenance": [],
"include_colab_link": true
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.6"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "m_hO7ZkoX5yh"
},
"source": [
"# Example notebook exploring CPTAC protein abundances\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "W2vs5M7OTvcx"
},
"source": [
"Check out more notebooks at our [Community Notebooks Repository!](https://github.com/isb-cgc/Community-Notebooks)\n",
"\n",
"```\n",
"Title: Example notebook exploring CPTAC protein abundances\n",
"Author: Boris Aguilar\n",
"Created: 01-19-2021\n",
"Purpose: Retrieve and analyze protein abundances from CPTAC\n",
"Notes: This notebook recapitulates the following notebook https://pdc.cancer.gov/API_documentation/PDC_clustergram.html \n",
"```\n",
"The notebook extracts protein abundances from the CPTAC Clear cell renal cell carcinoma (CCRCC) quant and their associated clinical metadata from the publicly available BigQuery tables that the ISB-CGC project has produced based on CPTAC. \n",
"Finally, the notebook clusters and visualizes the data using the Seaborn clustermap package.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "lspHBOL-X5y4"
},
"source": [
"## Modules"
]
},
{
"cell_type": "code",
"metadata": {
"id": "j6tM4e2EX5y5"
},
"source": [
"from google.cloud import bigquery\n",
"from google.colab import auth\n",
"import seaborn as se\n",
"import pandas as pd\n",
"import pandas_gbq\n",
"import matplotlib.pyplot as plt\n",
"from scipy.stats import zscore"
],
"execution_count": 56,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "yQUsmaJNJXso"
},
"source": [
"## Defining helper functions"
]
},
{
"cell_type": "code",
"metadata": {
"id": "pxl-aA71xTQK"
},
"source": [
"# A color mapping function for the clinical annotations\n",
"def get_colors(df, name, color) -> pd.Series:\n",
" s = pd.Series( df[name] ) \n",
" #s = df[name] \n",
" su = s.unique()\n",
" colors = se.light_palette(color, len(su))\n",
" lut = dict(zip(su, colors))\n",
" return s.map(lut)"
],
"execution_count": 48,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "MeCSS-zMNSuH"
},
"source": [
"## Google Authentication\n",
"The first step is to authorize access to BigQuery and the Google Cloud. For more information see ['Quick Start Guide to ISB-CGC'](https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/HowToGetStartedonISB-CGC.html) and alternative authentication methods can be found [here](https://googleapis.dev/python/google-api-core/latest/auth.html).\n",
"\n",
"Moreover you need to [create a google cloud](https://cloud.google.com/resource-manager/docs/creating-managing-projects#console) project to be able to run BigQuery queries."
]
},
{
"cell_type": "code",
"metadata": {
"id": "tRwRAo1QbV7f"
},
"source": [
"auth.authenticate_user()\n",
"my_project_id = \"\" # write your project id here"
],
"execution_count": 18,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "pY7u6evoX5zS"
},
"source": [
"## Fetch the data\n",
"The following code obtains protein abundances and clinical metada for the all the cases in the CPTAC CCRCC study. Specifically we join two tables quant_proteome_CPTAC_CCRCC_discovery_study_pdc_current and clinical_CPTAC3_discovery_pdc_current that host protein abundances and clinical metada, respectively. \n",
"\n",
"The results of query is automatically stored in pandas dataframe (quant_data) by the function read_gbq. "
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 436
},
"id": "dW5lRpCzX5zS",
"outputId": "cca061a4-38b9-4c1b-fb68-fcca069e3d7b"
},
"source": [
"sql = '''\n",
"SELECT pg.aliquot_submitter_id, pg.gene_symbol, \n",
" CAST(pg.protein_abundance_log2ratio as FLOAT64) as log2ratio,\n",
" clin.tumor_stage, clin.primary_diagnosis \n",
"FROM `isb-cgc-bq.CPTAC.quant_proteome_CPTAC_CCRCC_discovery_study_pdc_current` as pg\n",
"JOIN `isb-cgc-bq.CPTAC.clinical_CPTAC3_discovery_pdc_current` as clin\n",
"ON pg.case_id = clin.case_id\n",
"'''\n",
"quant_data = pandas_gbq.read_gbq(sql,project_id=my_project_id )\n",
"quant_data\n"
],
"execution_count": 19,
"outputs": [
{
"output_type": "stream",
"text": [
"Downloading: 100%|██████████| 1985337/1985337 [01:38<00:00, 20156.62rows/s]\n"
],
"name": "stderr"
},
{
"output_type": "execute_result",
"data": {
"text/html": [
"
| \n", " | aliquot_submitter_id | \n", "gene_symbol | \n", "log2ratio | \n", "tumor_stage | \n", "primary_diagnosis | \n", "
|---|---|---|---|---|---|
| 0 | \n", "NCI7-2 | \n", "COX1 | \n", "-0.2728 | \n", "None | \n", "None | \n", "
| 1 | \n", "QC5 | \n", "COX1 | \n", "-0.7462 | \n", "None | \n", "None | \n", "
| 2 | \n", "QC4 | \n", "COX1 | \n", "-0.8352 | \n", "None | \n", "None | \n", "
| 3 | \n", "QC7 | \n", "COX1 | \n", "-0.6299 | \n", "None | \n", "None | \n", "
| 4 | \n", "QC8 | \n", "COX1 | \n", "-0.9439 | \n", "None | \n", "None | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
| 1985332 | \n", "CPT0024680001 | \n", "LZTR1 | \n", "0.0672 | \n", "Stage III | \n", "Renal cell carcinoma, NOS | \n", "
| 1985333 | \n", "CPT0066430001 | \n", "LZTR1 | \n", "-0.2127 | \n", "Stage III | \n", "Renal cell carcinoma, NOS | \n", "
| 1985334 | \n", "CPT0009060003 | \n", "LZTR1 | \n", "-0.1471 | \n", "Stage III | \n", "Renal cell carcinoma, NOS | \n", "
| 1985335 | \n", "CPT0006730001 | \n", "LZTR1 | \n", "-0.0764 | \n", "Stage III | \n", "Renal cell carcinoma, NOS | \n", "
| 1985336 | \n", "CPT0024670003 | \n", "LZTR1 | \n", "-0.2035 | \n", "Stage III | \n", "Renal cell carcinoma, NOS | \n", "
1985337 rows × 5 columns
\n", "