{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Regulome Explorer ChiSquare test for categorical features\n", "Check out more notebooks at our ['Regulome Explorer Repository'](https://github.com/isb-cgc/Community-Notebooks/tree/master/RegulomeExplorer)!\n", "\n", "In this notebook we describe how Regulome Explorer uses the Chi-square test to compute statistical associations between two categorical features. This test is used when one of the categorical features has more than two categories. Fisher's exact test is used for the special case in which both features has only two categories. The Fisher's exact test is described in another notebook.\n", "\n", "We will use clinical data and Somatic mutations for this test, both of these features are available in BigQuery Tables. Details of the Chi-sqaure ttest can be found in the following link: https://en.wikipedia.org/wiki/Chi-squared_test\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Authenticate with Google (IMPORTANT)\n", "Our first step is to authenticate with Google -- you will need to be a member of a Google Cloud Platform (GCP) project, with authorization to run BigQuery jobs in order to run this notebook. If you don't have access to a GCP project, please contact the ISB-CGC team for help (www.isb-cgc.org)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Import Python libraries" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from google.cloud import bigquery\n", "import numpy as np\n", "import pandas as pd\n", "from scipy import stats\n", "import seaborn as sns\n", "import re_module.bq_functions as regulome" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Userdefined Parameters\n", "The parameters for this experiment are the cancer type, the name of the clinical feature, the name of the gene for which mutation can be extracted, and the minimun number of participant for categories to be considered. The clinical feature must be categorical. " ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "cancer_type = 'BRCA'\n", "clinical_name = 'histological_type'\n", "mutation_name = 'CDH1'\n", "MinSampleSize = 10\n", "\n", "bqclient = bigquery.Client()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data from BigQeury tables" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Clinical data from the BigQuery. The following string query will retrieve clinical data fromthe 'pancancer-atlas.Filtered.clinical_PANCAN_patient_with_followup_filtered' table available in pancancer-atlas dataset. " ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "query_table1 = \"\"\"table1 AS (\n", "SELECT\n", " symbol,\n", " avgdata AS data,\n", " ParticipantBarcode\n", "FROM (\n", " SELECT\n", " '{0}' AS symbol, \n", " {0} AS avgdata,\n", " bcr_patient_barcode AS ParticipantBarcode\n", " FROM `pancancer-atlas.Filtered.clinical_PANCAN_patient_with_followup_filtered`\n", " WHERE acronym = '{1}' AND {0} IS NOT NULL \n", " )\n", ")\n", "\"\"\".format(clinical_name, cancer_type)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Somatic mutation data from Bigquery table. The following string query will retrieve a table with patients with at least one Somatic mutation in the user defined gene ('mutation_name'). This information is extracted from the 'pancancer-atlas.Filtered.MC3_MAF_V5_one_per_tumor_sample' table, available in pancancer-atlas dataset. Notice that we only use samples in which FILTER = 'PASS'. " ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "query_table2 = \"\"\"table2 AS (\n", "SELECT\n", " symbol,\n", " ParticipantBarcode\n", "FROM (\n", " SELECT\n", " Hugo_Symbol AS symbol, \n", " ParticipantBarcode AS ParticipantBarcode\n", " FROM `pancancer-atlas.Filtered.MC3_MAF_V5_one_per_tumor_sample`\n", " WHERE Study = '{1}' AND Hugo_Symbol = '{0}'\n", " AND FILTER = 'PASS' \n", " GROUP BY\n", " ParticipantBarcode, symbol\n", " )\n", ")\n", "\"\"\".format( mutation_name , cancer_type )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following query combines the two tables based on Participant barcodes. Data of participants for which one feature is missing are not being used. Nij is the number of participants for each pair of categories. data1 is the categorical data fo the clinical feature specified by the user, and data is binary data which is 'YES' for pariticpants with mutation in the gene especified by the user. " ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "query_summarize = \"\"\"summ_table AS (\n", "SELECT \n", " n1.data as data1,\n", " IF( n2.ParticipantBarcode is null, 'NO', 'YES') as data2,\n", " COUNT(*) as Nij\n", "FROM\n", " table1 AS n1\n", "LEFT JOIN\n", " table2 AS n2\n", "ON\n", " n1.ParticipantBarcode = n2.ParticipantBarcode\n", "GROUP BY\n", " data1, data2\n", ") \n", "\"\"\".format(str(MinSampleSize) )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "At this point we can take a look at output table, where the column **Nij** is the number of participants for each pair of categorical values." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", " in runQuery ... \n", " the results for this query were previously cached \n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
data1data2Nij
0Infiltrating Carcinoma NOSNO1
1Infiltrating Ductal CarcinomaNO768
2Infiltrating Ductal CarcinomaYES9
3Infiltrating Lobular CarcinomaYES83
4Infiltrating Lobular CarcinomaNO118
5Medullary CarcinomaNO5
6Medullary CarcinomaYES1
7Metaplastic CarcinomaNO8
8Mixed Histology (please specify)NO26
9Mixed Histology (please specify)YES4
10Mucinous CarcinomaNO17
11Other specifyYES3
12Other specifyNO43
\n", "
" ], "text/plain": [ " data1 data2 Nij\n", "0 Infiltrating Carcinoma NOS NO 1\n", "1 Infiltrating Ductal Carcinoma NO 768\n", "2 Infiltrating Ductal Carcinoma YES 9\n", "3 Infiltrating Lobular Carcinoma YES 83\n", "4 Infiltrating Lobular Carcinoma NO 118\n", "5 Medullary Carcinoma NO 5\n", "6 Medullary Carcinoma YES 1\n", "7 Metaplastic Carcinoma NO 8\n", "8 Mixed Histology (please specify) NO 26\n", "9 Mixed Histology (please specify) YES 4\n", "10 Mucinous Carcinoma NO 17\n", "11 Other specify YES 3\n", "12 Other specify NO 43" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sql = ( 'WITH\\n' + query_table1 + ',' + query_table2 + ',' + query_summarize +\n", "\"\"\"SELECT * FROM summ_table \n", " ORDER BY data1\n", "\"\"\")\n", "\n", "df_results = regulome.runQuery ( bqclient, sql, [] , dryRun=False )\n", "df_results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The table shows that **data2** (Gene mutations) has two categories and **data1** (Clinical feature ) in this case has 8 categories. We can use python to visualize the populations in each category. " ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/borisaguilar/anaconda3/lib/python3.7/site-packages/matplotlib/tight_layout.py:176: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations. \n", " warnings.warn('Tight layout not applied. The left and right margins '\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "df_results.rename(columns={ \"data1\": clinical_name, \"data2\": mutation_name }, inplace=True)\n", "sns.catplot(y=clinical_name, x=\"Nij\",hue=mutation_name,data=df_results, kind=\"bar\",height=4, aspect=.7)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Compute the statistics \n", "After sumarizing the data in the table above, we are in the position to perform the Chi-Square test. Before the description of the test we need the following definitions:\n", "\n", "- **Nij** : Number of participants for each pair of categories\n", "- **N** : Total number of participants ( N = sum_ij Nij )\n", "- **Ni** : The total number of participants with category of data1(Clinical data) equal to i\n", "- **Nj** : The total number of participants with category of data2(Somatic mutation) equal to j\n", "- **E_nij** : Expected number of participants for each pair of categories under the null hypothesis. E_nij = (Ni\\*Nj)/N\n", "- **I, J** : The number of categories in data1 and data2 respectively\n", "\n", "The implementation of the Chi-Square test consists of two steps:\n", "\n", "1 ) Generate the contingency table (see https://en.wikipedia.org/wiki/Contingency_table ) which in our case is a table with the values of **Nij** and **E_nij** for each pair of categorical values.\n", "\n", "2 ) Compute Chi-square value as :\n", " $$\\chi^2 = \\sum_{i=1}^{I}\\sum_{j=1}^{J}\\frac{ (N_{ij} - E[n_{ij}] )^2 }{E[n_{ij}]}$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To generate the contingency table we first use CROSS JOIN to form a table with all possible pairs between the two categorical features. Only categories with more than 'MinSampleSize' are considered, 5 is typically used for the Chi-Squared test. The following string performs that operation:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", " in runQuery ... \n", " the results for this query were previously cached \n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
data1data2
0Infiltrating Lobular CarcinomaYES
1Infiltrating Lobular CarcinomaNO
2Other specifyYES
3Other specifyNO
4Infiltrating Ductal CarcinomaYES
5Infiltrating Ductal CarcinomaNO
6Mixed Histology (please specify)YES
7Mixed Histology (please specify)NO
8Mucinous CarcinomaYES
9Mucinous CarcinomaNO
\n", "
" ], "text/plain": [ " data1 data2\n", "0 Infiltrating Lobular Carcinoma YES\n", "1 Infiltrating Lobular Carcinoma NO\n", "2 Other specify YES\n", "3 Other specify NO\n", "4 Infiltrating Ductal Carcinoma YES\n", "5 Infiltrating Ductal Carcinoma NO\n", "6 Mixed Histology (please specify) YES\n", "7 Mixed Histology (please specify) NO\n", "8 Mucinous Carcinoma YES\n", "9 Mucinous Carcinoma NO" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query_expected= \"\"\"expected_table AS (\n", "SELECT data1, data2\n", "FROM ( \n", " SELECT data1, SUM(Nij) as Ni \n", " FROM summ_table\n", " GROUP BY data1 ) \n", "CROSS JOIN ( \n", " SELECT data2, SUM(Nij) as Nj\n", " FROM summ_table\n", " GROUP BY data2 )\n", " \n", "WHERE Ni > {0} AND Nj > {0}\n", ")\n", "\"\"\".format(str(MinSampleSize) )\n", "\n", "sql = ( 'WITH\\n' + query_table1 + ',' + query_table2 + ',' + query_summarize + ',' + query_expected +\n", "\"\"\"SELECT * FROM expected_table \n", "\"\"\") \n", "\n", "regulome.runQuery ( bqclient, sql, [] , dryRun=False )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice the resulting table has $I * J$ rows. Next, the contingency table is generated by using an \"INNER JOIN\" and filling the missing values of **Nij** with zeros (the IF statement in the query below)." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", " in runQuery ... \n", " the results for this query were previously cached \n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
data1data2NijE_nij
0Infiltrating Ductal CarcinomaNO768705.176471
1Infiltrating Lobular CarcinomaNO118182.420168
2Mixed Histology (please specify)NO2627.226891
3Mucinous CarcinomaNO1715.428571
4Other specifyNO4341.747899
5Infiltrating Ductal CarcinomaYES971.823529
6Infiltrating Lobular CarcinomaYES8318.579832
7Mixed Histology (please specify)YES42.773109
8Mucinous CarcinomaYES01.571429
9Other specifyYES34.252101
\n", "
" ], "text/plain": [ " data1 data2 Nij E_nij\n", "0 Infiltrating Ductal Carcinoma NO 768 705.176471\n", "1 Infiltrating Lobular Carcinoma NO 118 182.420168\n", "2 Mixed Histology (please specify) NO 26 27.226891\n", "3 Mucinous Carcinoma NO 17 15.428571\n", "4 Other specify NO 43 41.747899\n", "5 Infiltrating Ductal Carcinoma YES 9 71.823529\n", "6 Infiltrating Lobular Carcinoma YES 83 18.579832\n", "7 Mixed Histology (please specify) YES 4 2.773109\n", "8 Mucinous Carcinoma YES 0 1.571429\n", "9 Other specify YES 3 4.252101" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query_contingency = \"\"\"contingency_table AS (\n", "SELECT\n", " T1.data1,\n", " T1.data2,\n", " IF( Nij IS NULL, 0, Nij) as Nij,\n", " (SUM(Nij) OVER (PARTITION BY T1.data1))*(SUM(Nij) OVER (PARTITION BY T1.data2))/ SUM(Nij) OVER () AS E_nij\n", " \n", "FROM\n", " expected_table AS T1\n", "LEFT JOIN\n", " summ_table AS T2\n", "ON \n", " T1.data1 = T2.data1 AND T1.data2 = T2.data2\n", ")\n", "\"\"\"\n", "\n", "sql = ( 'WITH\\n' + query_table1 + ',' + query_table2 + ',' + query_summarize + ',' + query_expected + ',' + query_contingency +\n", "\"\"\"SELECT * FROM contingency_table\n", " ORDER BY data2, data1\n", "\"\"\") \n", "\n", "df_contingency = regulome.runQuery ( bqclient, sql, [] , dryRun=False )\n", "df_contingency" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From this contingency table, we can use python to compute the Chi-Square statistics from the contingency table. This is used to validate our BigQuery implementation:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Chi2, p, dof : 309.39167298726557 1.020302231598449e-65 4\n", "Expected nij : \n", "[[705.17647059 182.42016807 27.22689076 15.42857143 41.74789916]\n", " [ 71.82352941 18.57983193 2.77310924 1.57142857 4.25210084]]\n" ] } ], "source": [ "yes_a = df_contingency[ df_contingency['data2'] == 'NO' ]['Nij'].values\n", "no_a = df_contingency[ df_contingency['data2'] == 'YES' ]['Nij'].values \n", "conting_table = [yes_a , no_a]\n", "\n", "chi2, p, dof, expected_nij = stats.chi2_contingency( conting_table ) \n", "print( \"Chi2, p, dof : \", chi2, p , dof)\n", "print( \"Expected nij : \")\n", "print(expected_nij)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following query string computes the Chi-square and the Cramer's V statistics from the contingency table." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", " in runQuery ... \n", " the results for this query were previously cached \n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IJNChi2V
0521071309.3916730.537477
\n", "
" ], "text/plain": [ " I J N Chi2 V\n", "0 5 2 1071 309.391673 0.537477" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sql = ( 'WITH\\n' + query_table1 + ',' + query_table2 + ',' + query_summarize + ',' + query_expected + ',' + query_contingency +\n", "\"\"\"\n", "SELECT I, J, N, Chi2,\n", " IF(I > J, SQRT( Chi2 /(N*(J-1))),SQRT(Chi2/(N*(I-1)) ) ) as V\n", "FROM (\n", " SELECT\n", " COUNT( DISTINCT data1 ) as I,\n", " COUNT( DISTINCT data2 ) as J,\n", " SUM(Nij) as N,\n", " SUM( (Nij - E_nij)*(Nij - E_nij) / E_nij ) as Chi2 \n", " FROM contingency_table\n", " )\n", "\"\"\") \n", "\n", "df_chi = regulome.runQuery ( bqclient, sql, [] , dryRun=False )\n", "df_chi" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The computed Chi2 using BigQuery can be compared to that obtained with python. A large value of Chi2 indicates that the null hypothesis (that data1 and data2 has no association) is rather unlikely. The degrees of freedom ($IJ-I-J-1$) and the Chi2 value are need to compute the p value of the null hypothesis. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.4" } }, "nbformat": 4, "nbformat_minor": 2 }