{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "name": "How_to_explore_CPTAC_protein_abundances.ipynb", "provenance": [], "include_colab_link": true }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "cells": [ { "cell_type": "markdown", "metadata": { "id": "view-in-github", "colab_type": "text" }, "source": [ "" ] }, { "cell_type": "markdown", "metadata": { "id": "m_hO7ZkoX5yh" }, "source": [ "# Example notebook exploring CPTAC protein abundances\n" ] }, { "cell_type": "markdown", "metadata": { "id": "W2vs5M7OTvcx" }, "source": [ "Check out more notebooks at our [Community Notebooks Repository!](https://github.com/isb-cgc/Community-Notebooks)\n", "\n", "```\n", "Title: Example notebook exploring CPTAC protein abundances\n", "Author: Boris Aguilar\n", "Created: 01-19-2021\n", "Purpose: Retrieve and analyze protein abundances from CPTAC\n", "Notes: This notebook recapitulates the following notebook https://pdc.cancer.gov/API_documentation/PDC_clustergram.html \n", "```\n", "The notebook extracts protein abundances from the CPTAC Clear cell renal cell carcinoma (CCRCC) quant and their associated clinical metadata from the publicly available BigQuery tables that the ISB-CGC project has produced based on CPTAC. \n", "Finally, the notebook clusters and visualizes the data using the Seaborn clustermap package.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "lspHBOL-X5y4" }, "source": [ "## Modules" ] }, { "cell_type": "code", "metadata": { "id": "j6tM4e2EX5y5" }, "source": [ "from google.cloud import bigquery\n", "from google.colab import auth\n", "import seaborn as se\n", "import pandas as pd\n", "import pandas_gbq\n", "import matplotlib.pyplot as plt\n", "from scipy.stats import zscore" ], "execution_count": 56, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "yQUsmaJNJXso" }, "source": [ "## Defining helper functions" ] }, { "cell_type": "code", "metadata": { "id": "pxl-aA71xTQK" }, "source": [ "# A color mapping function for the clinical annotations\n", "def get_colors(df, name, color) -> pd.Series:\n", " s = pd.Series( df[name] ) \n", " #s = df[name] \n", " su = s.unique()\n", " colors = se.light_palette(color, len(su))\n", " lut = dict(zip(su, colors))\n", " return s.map(lut)" ], "execution_count": 48, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "MeCSS-zMNSuH" }, "source": [ "## Google Authentication\n", "The first step is to authorize access to BigQuery and the Google Cloud. For more information see ['Quick Start Guide to ISB-CGC'](https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/HowToGetStartedonISB-CGC.html) and alternative authentication methods can be found [here](https://googleapis.dev/python/google-api-core/latest/auth.html).\n", "\n", "Moreover you need to [create a google cloud](https://cloud.google.com/resource-manager/docs/creating-managing-projects#console) project to be able to run BigQuery queries." ] }, { "cell_type": "code", "metadata": { "id": "tRwRAo1QbV7f" }, "source": [ "auth.authenticate_user()\n", "my_project_id = \"\" # write your project id here" ], "execution_count": 18, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "pY7u6evoX5zS" }, "source": [ "## Fetch the data\n", "The following code obtains protein abundances and clinical metada for the all the cases in the CPTAC CCRCC study. Specifically we join two tables quant_proteome_CPTAC_CCRCC_discovery_study_pdc_current and clinical_CPTAC3_discovery_pdc_current that host protein abundances and clinical metada, respectively. \n", "\n", "The results of query is automatically stored in pandas dataframe (quant_data) by the function read_gbq. " ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 436 }, "id": "dW5lRpCzX5zS", "outputId": "cca061a4-38b9-4c1b-fb68-fcca069e3d7b" }, "source": [ "sql = '''\n", "SELECT pg.aliquot_submitter_id, pg.gene_symbol, \n", " CAST(pg.protein_abundance_log2ratio as FLOAT64) as log2ratio,\n", " clin.tumor_stage, clin.primary_diagnosis \n", "FROM `isb-cgc-bq.CPTAC.quant_proteome_CPTAC_CCRCC_discovery_study_pdc_current` as pg\n", "JOIN `isb-cgc-bq.CPTAC.clinical_CPTAC3_discovery_pdc_current` as clin\n", "ON pg.case_id = clin.case_id\n", "'''\n", "quant_data = pandas_gbq.read_gbq(sql,project_id=my_project_id )\n", "quant_data\n" ], "execution_count": 19, "outputs": [ { "output_type": "stream", "text": [ "Downloading: 100%|██████████| 1985337/1985337 [01:38<00:00, 20156.62rows/s]\n" ], "name": "stderr" }, { "output_type": "execute_result", "data": { "text/html": [ "
\n", " | aliquot_submitter_id | \n", "gene_symbol | \n", "log2ratio | \n", "tumor_stage | \n", "primary_diagnosis | \n", "
---|---|---|---|---|---|
0 | \n", "NCI7-2 | \n", "COX1 | \n", "-0.2728 | \n", "None | \n", "None | \n", "
1 | \n", "QC5 | \n", "COX1 | \n", "-0.7462 | \n", "None | \n", "None | \n", "
2 | \n", "QC4 | \n", "COX1 | \n", "-0.8352 | \n", "None | \n", "None | \n", "
3 | \n", "QC7 | \n", "COX1 | \n", "-0.6299 | \n", "None | \n", "None | \n", "
4 | \n", "QC8 | \n", "COX1 | \n", "-0.9439 | \n", "None | \n", "None | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
1985332 | \n", "CPT0024680001 | \n", "LZTR1 | \n", "0.0672 | \n", "Stage III | \n", "Renal cell carcinoma, NOS | \n", "
1985333 | \n", "CPT0066430001 | \n", "LZTR1 | \n", "-0.2127 | \n", "Stage III | \n", "Renal cell carcinoma, NOS | \n", "
1985334 | \n", "CPT0009060003 | \n", "LZTR1 | \n", "-0.1471 | \n", "Stage III | \n", "Renal cell carcinoma, NOS | \n", "
1985335 | \n", "CPT0006730001 | \n", "LZTR1 | \n", "-0.0764 | \n", "Stage III | \n", "Renal cell carcinoma, NOS | \n", "
1985336 | \n", "CPT0024670003 | \n", "LZTR1 | \n", "-0.2035 | \n", "Stage III | \n", "Renal cell carcinoma, NOS | \n", "
1985337 rows × 5 columns
\n", "