{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "kernelspec": { "name": "python3", "display_name": "Python 3" }, "colab": { "name": "ACM_BCB_2020_POSTER_KruskalWallisTest_ProteinGeneExpression_vs_ClinicalFeatures.ipynb", "provenance": [], "collapsed_sections": [] } }, "cells": [ { "cell_type": "markdown", "metadata": { "id": "SJcF-vtU7rpj", "colab_type": "text" }, "source": [ "# Poster Notebook: Kruskal Wallis test for associations between Protein/Gene Expression and Clinical Features\n", "```\n", "Created: 09-20-2020\n", "URL: https://github.com/isb-cgc/Community-Notebooks/blob/master/FeaturedNotebooks/ACM_BCB_2020_POSTER_KruskalWallisTest_ProteinGeneExpression_vs_ClinicalFeatures.ipynb\n", "Note: This notebook supports the POSTER : \"Multi-omics Data Integration in the Cloud: Analysis \n", "of Statistically Significant Associations Between Clinical and Molecular Features in Breast Cancer\" \n", "by K. Abdilleh, B. Aguilar, and R. Thomson , presented in the ACM Conference on Bioinformatics, \n", "Computational Biology, and Health Informatics, 2020.\n", "```\n", "***\n", "\n", "This Notebook computes statistically significant associations between Protein/Gene expression and clinical features of Breast cancer, using data available in TCGA BigQuery tables.\n", "\n", "The associations were computed using the Kruskal Wallis (KW) test, implemented as user defined function in Bigquery. Details of the KW test and its implementatin can be found in: https://github.com/jrossthomson/bigquery-utils/tree/master/udfs/statslib\n", "\n", "Violin plots are presented for the clinical features with the most significant associations with protein and gene expression." ] }, { "cell_type": "markdown", "metadata": { "id": "X53EqHduuXEf", "colab_type": "text" }, "source": [ "# Setup" ] }, { "cell_type": "code", "metadata": { "id": "TppLwn_uF4Y1", "colab_type": "code", "colab": { "base_uri": "https://localhost:8080/", "height": 51 }, "outputId": "56c2d50d-09c0-4643-841d-dcfc8ac2d005" }, "source": [ "import sys\n", "#! {sys.executable} -m pip install matplotlib seaborn\n", "#! {sys.executable} -m pip install google-cloud\n", "#! {sys.executable} -m pip install google-cloud\n", "#! {sys.executable} -m pip install google-auth\n", "print({sys.executable})\n", "from platform import python_version\n", "\n", "print(python_version())" ], "execution_count": 7, "outputs": [ { "output_type": "stream", "text": [ "{'/usr/bin/python3'}\n", "3.6.9\n" ], "name": "stdout" } ] }, { "cell_type": "markdown", "metadata": { "id": "_OeROlcGWi5-", "colab_type": "text" }, "source": [ "# Authentication\n" ] }, { "cell_type": "code", "metadata": { "id": "b-debebxHIWw", "colab_type": "code", "colab": {} }, "source": [ "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "import pandas_gbq\n", "from google.colab import auth\n", "import google.auth\n", "\n", "auth.authenticate_user()\n", "# Explicitly create a credentials object. This allows you to use the same\n", "# credentials for both the BigQuery and BigQuery Storage clients, avoiding\n", "# unnecessary API calls to fetch duplicate authentication tokens.\n", "\n", "credentials, your_project_id = google.auth.default(\n", " scopes=[\"https://www.googleapis.com/auth/cloud-platform\"]\n", ")\n", "\n" ], "execution_count": 8, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "uPBtiq2QXAu6", "colab_type": "text" }, "source": [ "# Run Kruskal Wallis test over Proteins Expression, Gene expression, and Clinical data \n", "The code below uses the following three tables available in ISB-CGC :\n", "- Protein expression table: `isb-cgc.TCGA_hg19_data_v0.Protein_Expression`\n", "- Gene Expression table: `isb-cgc.TCGA_hg19_data_v0.RNAseq_Gene_Expression_UNC_RSEM`\n", "- Clinical data: `isb-cgc-bq.supplementary_tables.Abdilleh_etal_ACM_BCB_2020_TCGA_bioclin_v0_Clinical_UNPIVOT`\n", "\n", "The Kruskal Wallis test is implemented in a user defined function called `isb-cgc-bq.functions.kruskal_wallis_current`.\n", "\n", "The code below uses KW to compute significant (p-value < 0.001) associations between Protein expression and Clinical Features in Breat cancer patients. The output (protein names) are then used in a second KW test that identify significant associations between clinical features and both gene and protein expression.\n", "\n", "The final output is a table with protein/genes, clinical features, and p-values of the Kruskal Wallis tests.\n" ] }, { "cell_type": "code", "metadata": { "id": "-0m6TjJtF4Y9", "colab_type": "code", "colab": { "base_uri": "https://localhost:8080/", "height": 419 }, "outputId": "bdc6b2fa-f045-4313-9cf6-57303a722707" }, "source": [ "cancer_type = 'TCGA-BRCA' # https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/tcga-study-abbreviations\n", "significance_level = '0.001'\n", "project_id=\"\" # write your project id here\n", "sql = '''\n", "with the_proteins as (\n", " SELECT p.project_short_name as study, gene_name as g, c.feature.key as c, `isb-cgc-bq.functions.kruskal_wallis_current`(array_agg((c.feature.value,protein_expression))) as reso\n", " FROM `isb-cgc.TCGA_hg19_data_v0.Protein_Expression` p\n", " JOIN `isb-cgc-bq.supplementary_tables.Abdilleh_etal_ACM_BCB_2020_TCGA_bioclin_v0_Clinical_UNPIVOT` c\n", " ON c.case_barcode = substr(p.sample_barcode,0,12)\n", " WHERE 1=1 AND c.feature.value != \"null\" AND p.project_short_name = \"{0}\"\n", " GROUP BY study, g, c\n", " HAVING reso.DoF > 1 and reso.DoF < 10 #and reso.p <= {1}\n", " ORDER BY study, reso.p, c\n", ") # the_goods\n", ",\n", "the_goods as (\n", " SELECT HGNC_gene_symbol as g, c.feature.key as c, `isb-cgc-bq.functions.kruskal_wallis_current`(array_agg((c.feature.value,normalized_count))) as reso\n", " FROM `isb-cgc.TCGA_hg19_data_v0.RNAseq_Gene_Expression_UNC_RSEM` p\n", " JOIN `isb-cgc-bq.supplementary_tables.Abdilleh_etal_ACM_BCB_2020_TCGA_bioclin_v0_Clinical_UNPIVOT` c\n", " ON c.case_barcode = substr(p.sample_barcode,0,12)\n", " where 1=1\n", " and c.feature.value != \"null\"\n", " and HGNC_gene_symbol in ( SELECT gene_name FROM `isb-cgc.TCGA_hg19_data_v0.Protein_Expression` GROUP BY 1 )\n", " and p.project_short_name = \"{0}\"\n", " GROUP BY g, c\n", " HAVING reso.DoF > 1 and reso.DoF < 10 #and reso.p <= {1}\n", " ORDER BY reso.p \n", ") # the_goods\n", "select pr.g , pr.c, pr.reso.p as p_protein, ge.reso.p as p_gexp\n", "from the_proteins pr\n", "join the_goods ge\n", "on ge.g = pr.g and ge.c = pr.c \n", "where pr.reso.p < {1} and ge.reso.p < {1} \n", "ORDER BY p_protein ASC, p_gexp DESC\n", "'''.format( cancer_type, significance_level )\n", "df = pandas_gbq.read_gbq(sql, project_id=project_id)\n", "df" ], "execution_count": 9, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", " | g | \n", "c | \n", "p_protein | \n", "p_gexp | \n", "
---|---|---|---|---|
0 | \n", "CDH1 | \n", "histological_type | \n", "0.000000e+00 | \n", "0.000000e+00 | \n", "
1 | \n", "ESR1 | \n", "histological_type | \n", "9.214851e-15 | \n", "3.888400e-10 | \n", "
2 | \n", "RPS6 | \n", "histological_type | \n", "5.359047e-13 | \n", "5.574927e-06 | \n", "
3 | \n", "SLC1A5 | \n", "histological_type | \n", "1.292966e-12 | \n", "1.439255e-04 | \n", "
4 | \n", "ASNS | \n", "race | \n", "2.190137e-12 | \n", "2.731149e-14 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
71 | \n", "BCL2A1 | \n", "histological_type | \n", "6.214180e-04 | \n", "6.058064e-04 | \n", "
72 | \n", "CDH3 | \n", "histological_type | \n", "7.010622e-04 | \n", "3.432601e-08 | \n", "
73 | \n", "EIF4G1 | \n", "race | \n", "7.433937e-04 | \n", "7.275839e-05 | \n", "
74 | \n", "FN1 | \n", "histological_type | \n", "9.041138e-04 | \n", "8.573417e-06 | \n", "
75 | \n", "STAT3 | \n", "race | \n", "9.066003e-04 | \n", "1.864869e-09 | \n", "
76 rows × 4 columns
\n", "