{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "vLscn83vXHli"
   },
   "source": [
    "# ISB-CGC Community Notebooks\n",
    "\n",
    "Check out more notebooks at our [Community Notebooks Repository](https://github.com/isb-cgc/Community-Notebooks)!\n",
    "\n",
    "```\n",
    "Title:   One-Way ANOVA Test in BigQuery\n",
    "Author:  Lauren Hagen\n",
    "Created: 2019-08-02\n",
    "Purpose: Demonstrate an ANOVA test within BigQuery\n",
    "URL:     https://github.com/isb-cgc/Community-Notebooks/blob/master/Notebooks/How_to_perform_an_ANOVA_test_in_BigQuery.ipynb\n",
    "Notes:   This notebook was adapted from work by David L Gibbs, September 2017 Query of the Month\n",
    "```\n",
    "***\n",
    "\n",
    "## Running ANOVA tests in BigQuery\n",
    "\n",
    "In this notebook, we will cover how to do a one-way ANOVA test in BigQuery. This statistical test can be used to determine whether there is a statistically significant difference between the means of two or more independent groups. Although in this example, I’m only looking at two groups, it would not be difficult to extend this to any number of groups, assuming there is a reasonable number of samples within each group.\n",
    "\n",
    "Consider the model y<sub>ij</sub> = m + a<sub>i</sub> + e<sub>ij</sub>, where y<sub>ij</sub> is a continuous variable over samples j, in groups i, and a<sub>i</sub> is a constant for each group i, and e<sub>ij</sub> is a gaussian error term with mean 0.\n",
    "\n",
    "Using this model, we are describing the data as being sampled from groups, with each group having a mean value equal to m + a<sub>i</sub>. The null hypothesis is that each of the group means is the same (ie that the ai terms are zero), while the alternative hypothesis is that at least one of the ai terms is not zero.\n",
    "\n",
    "We use the F-test to compare these two hypotheses. To compute the test statistic, we compute the within-group variation and the between-group variation. Recall that sample variance is defined as the sum of squared differences between observations and the mean, divided by the number of samples (normalized).\n",
    "\n",
    "First, we will need to import bigquery module, authenticate ourselves, and create a client variable. For more information see ['Quick Start Guide to ISB-CGC'](https://github.com/isb-cgc/Community-Notebooks/blob/master/Notebooks/Quick_Start_Guide_to_ISB_CGC.ipynb) and alternative authentication methods can be found [here](https://googleapis.github.io/google-cloud-python/latest/core/auth.html).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "fZArtDc-6wfv"
   },
   "outputs": [],
   "source": [
    "# Import BigQuery Module\n",
    "from google.cloud import bigquery"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 360
    },
    "colab_type": "code",
    "id": "iIwtOPNx6y0P",
    "outputId": "0adb358e-d358-43b6-a433-ddb8123a762e"
   },
   "outputs": [],
   "source": [
    "# Autheticate ourselves\n",
    "!gcloud auth application-default login"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "7LcFK-VD7qde"
   },
   "outputs": [],
   "source": [
    "# Create a variable for which client to use with BigQuery\n",
    "project = 'your_project_number' # Replace with your project ID\n",
    "if project_num == 'your_project_number':\n",
    "    print('Please update the project number with your Google Cloud Project')\n",
    "else:\n",
    "    client = bigquery.Client(project) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "ihpge8HA6qk6"
   },
   "source": [
    "Let’s look at the query:\n",
    "** Note: you will need to update 'your_project_number' with your project number before continuing with the notebook **"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "lojCKMV4513d"
   },
   "outputs": [],
   "source": [
    "# Magic command of bigquery with the project id as isb-cgc-02-0001 and create a Pandas Dataframe\n",
    "# Change isb-cgc-02-0001 to your project ID\n",
    "%%bigquery anova --project your_project_number\n",
    "WITH\n",
    "  -- using standard SQL,\n",
    "  -- we'll select our cohort and gene expression\n",
    "  --\n",
    "  cohortExpr AS (\n",
    "  SELECT\n",
    "    sample_barcode,\n",
    "    LOG10(normalized_count) AS expr\n",
    "  FROM\n",
    "    `isb-cgc.TCGA_hg19_data_v0.RNAseq_Gene_Expression_UNC_RSEM`\n",
    "  WHERE\n",
    "    project_short_name = 'TCGA-BRCA'\n",
    "    AND HGNC_gene_symbol = 'TP53'\n",
    "    AND normalized_count IS NOT NULL\n",
    "    AND normalized_count > 0),\n",
    "  --\n",
    "  -- And we'll select the variant data for our cohort,\n",
    "  -- we're going to be comparing variant types (SNP, DEL, etc)\n",
    "  --\n",
    "  cohortVar AS (\n",
    "  SELECT\n",
    "    Variant_Type,\n",
    "    sample_barcode_tumor AS sample_barcode\n",
    "  FROM\n",
    "    `isb-cgc.TCGA_hg19_data_v0.Somatic_Mutation_MC3`\n",
    "  WHERE\n",
    "    SYMBOL = 'TP53' ),\n",
    "  --\n",
    "  -- then we join the expression and variant data using sample barcodes\n",
    "  --\n",
    "  cohort AS (\n",
    "  SELECT\n",
    "    cohortExpr.sample_barcode AS sample_barcode,\n",
    "    Variant_Type AS group_name,\n",
    "    expr\n",
    "  FROM\n",
    "    cohortExpr\n",
    "  JOIN\n",
    "    cohortVar\n",
    "  ON\n",
    "    cohortExpr.sample_barcode = cohortVar.sample_barcode ),\n",
    "  --\n",
    "  -- First part of the calculation, the grand mean (over everything)\n",
    "  --\n",
    "  grandMeanTable AS (\n",
    "  SELECT\n",
    "    AVG(expr) AS grand_mean\n",
    "  FROM\n",
    "    cohort ),\n",
    "  --\n",
    "  -- Then we need a mean per group, and we can get a count of samples\n",
    "  -- per group.\n",
    "  --\n",
    "  groupMeansTable AS (\n",
    "  SELECT\n",
    "    AVG(expr) AS group_mean,\n",
    "    group_name,\n",
    "    COUNT(sample_barcode) AS n\n",
    "  FROM\n",
    "    cohort\n",
    "  GROUP BY\n",
    "    group_name),\n",
    "  --\n",
    "  -- To get the between-group variance\n",
    "  -- we take the difference between the grand mean\n",
    "  -- and the means for each group and sum over all samples\n",
    "  -- ... a short cut being taking the product with n.\n",
    "  -- Later we'll sum over the n_sq_diff\n",
    "  --\n",
    "  ssBetween AS (\n",
    "  SELECT\n",
    "    group_name,\n",
    "    group_mean,\n",
    "    grand_mean,\n",
    "    n,\n",
    "    n*POW(group_mean - grand_mean,2) AS n_diff_sq\n",
    "  FROM\n",
    "    groupMeansTable\n",
    "  CROSS JOIN\n",
    "    grandMeanTable ),\n",
    "  --\n",
    "  -- Then, to get the variance within each group\n",
    "  -- we have to build a table matching up the group mean\n",
    "  -- with the values for each group. So we join the group\n",
    "  -- means to the values on group name. We are going to\n",
    "  -- sum over this table just like ssBetween\n",
    "  --\n",
    "  ssWithin AS (\n",
    "  SELECT\n",
    "    a.group_name AS group_name,\n",
    "    expr,\n",
    "    group_mean,\n",
    "    b.n AS n,\n",
    "    POW(expr - group_mean, 2) AS s2\n",
    "  FROM\n",
    "    cohort a\n",
    "  JOIN\n",
    "    ssBetween b\n",
    "  ON\n",
    "    a.group_name = b.group_name ),\n",
    "  --\n",
    "  -- The F stat comes from a ratio, the numerator is\n",
    "  -- calculated using the between group variance, and\n",
    "  -- dividing by the number of groups (k) minus 1.\n",
    "  --\n",
    "  numerator AS (\n",
    "  SELECT\n",
    "    'dummy' AS dummy,\n",
    "    SUM(n_diff_sq) / (count(group_name) - 1) AS mean_sq_between\n",
    "  FROM\n",
    "    ssBetween ),\n",
    "  --\n",
    "  -- The denominator of the F stat ratio is found using the\n",
    "  -- variance within groups. We divide the sum of the within\n",
    "  -- group variance and divide it by (n-k).\n",
    "  --\n",
    "  denominator AS (\n",
    "  SELECT\n",
    "    'dummy' AS dummy,\n",
    "    COUNT(distinct(group_name)) AS k,\n",
    "    COUNT(group_name) AS n,\n",
    "    SUM(s2)/(COUNT(group_name)-COUNT(distinct(group_name))) AS mean_sq_within\n",
    "  FROM\n",
    "    ssWithin),\n",
    "  --\n",
    "  -- Now we're ready to calculate F!\n",
    "  --\n",
    "  Ftable AS (\n",
    "  SELECT\n",
    "    n,\n",
    "    k,\n",
    "    mean_sq_between,\n",
    "    mean_sq_within,\n",
    "    mean_sq_between / mean_sq_within AS F\n",
    "  FROM\n",
    "    numerator\n",
    "  JOIN\n",
    "    denominator\n",
    "  ON\n",
    "    numerator.dummy = denominator.dummy)\n",
    "\n",
    "SELECT\n",
    "  *\n",
    "FROM\n",
    "  Ftable"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 80
    },
    "colab_type": "code",
    "id": "4UQE_Vt17kot",
    "outputId": "d8d25de7-d4c7-4865-e2bf-64c3381d4be1"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>n</th>\n",
       "      <th>k</th>\n",
       "      <th>mean_sq_between</th>\n",
       "      <th>mean_sq_within</th>\n",
       "      <th>F</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>375</td>\n",
       "      <td>3</td>\n",
       "      <td>6.230237</td>\n",
       "      <td>0.095634</td>\n",
       "      <td>65.146988</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     n  k  mean_sq_between  mean_sq_within          F\n",
       "0  375  3         6.230237        0.095634  65.146988"
      ]
     },
     "execution_count": 6,
     "metadata": {
      "tags": []
     },
     "output_type": "execute_result"
    }
   ],
   "source": [
    "anova"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "fnM-nH70DXtR"
   },
   "source": [
    "OK, so let’s check our work. Using the BRCA cohort and TP53 as our gene, we have 375 samples with a variant in this gene. We’re going to look at whether the type of variant is related to the gene expression we observe. If we just pull down the data using the ‘cohort’ subtable (as above), we can get a small data frame, which let’s us do the standard F stat table.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "EmFemcovDXFC"
   },
   "outputs": [],
   "source": [
    "# Magic command of bigquery with the project id as isb-cgc-02-0001 and create a Pandas Dataframe\n",
    "# Change isb-cgc-02-0001 to your project ID\n",
    "%%bigquery dat --project your_project_number\n",
    "WITH\n",
    "  -- using standard SQL,\n",
    "  -- we'll select our cohort and gene expression\n",
    "  --\n",
    "  cohortExpr AS (\n",
    "  SELECT\n",
    "    sample_barcode,\n",
    "    LOG10(normalized_count) AS expr\n",
    "  FROM\n",
    "    `isb-cgc.TCGA_hg19_data_v0.RNAseq_Gene_Expression_UNC_RSEM`\n",
    "  WHERE\n",
    "    project_short_name = 'TCGA-BRCA'\n",
    "    AND HGNC_gene_symbol = 'TP53'\n",
    "    AND normalized_count IS NOT NULL\n",
    "    AND normalized_count > 0),\n",
    "  --\n",
    "  -- And we'll select the variant data for our cohort,\n",
    "  -- we're going to be comparing variant types (SNP, DEL, etc)\n",
    "  --\n",
    "  cohortVar AS (\n",
    "  SELECT\n",
    "    Variant_Type,\n",
    "    sample_barcode_tumor AS sample_barcode\n",
    "  FROM\n",
    "    `isb-cgc.TCGA_hg19_data_v0.Somatic_Mutation_MC3`\n",
    "  WHERE\n",
    "    SYMBOL = 'TP53' ),\n",
    "  --\n",
    "  -- then we join the expression and variant data using sample barcodes\n",
    "  --\n",
    "  cohort AS (\n",
    "  SELECT\n",
    "    cohortExpr.sample_barcode AS sample_barcode,\n",
    "    Variant_Type AS group_name,\n",
    "    expr\n",
    "  FROM\n",
    "    cohortExpr\n",
    "  JOIN\n",
    "    cohortVar\n",
    "  ON\n",
    "    cohortExpr.sample_barcode = cohortVar.sample_barcode )\n",
    "SELECT sample_barcode, group_name, expr\n",
    "FROM cohort"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 204
    },
    "colab_type": "code",
    "id": "35eBL36IDnQq",
    "outputId": "39c2ad59-9cc3-4e7f-eb00-5be7dd8d715d"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>sample_barcode</th>\n",
       "      <th>group_name</th>\n",
       "      <th>expr</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>TCGA-E2-A155-01A</td>\n",
       "      <td>SNP</td>\n",
       "      <td>2.982242</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>TCGA-AR-A0TV-01A</td>\n",
       "      <td>SNP</td>\n",
       "      <td>3.139173</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>TCGA-A2-A0CL-01A</td>\n",
       "      <td>SNP</td>\n",
       "      <td>3.340913</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>TCGA-B6-A0IK-01A</td>\n",
       "      <td>SNP</td>\n",
       "      <td>3.470800</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>TCGA-AN-A0FL-01A</td>\n",
       "      <td>SNP</td>\n",
       "      <td>2.312675</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     sample_barcode group_name      expr\n",
       "0  TCGA-E2-A155-01A        SNP  2.982242\n",
       "1  TCGA-AR-A0TV-01A        SNP  3.139173\n",
       "2  TCGA-A2-A0CL-01A        SNP  3.340913\n",
       "3  TCGA-B6-A0IK-01A        SNP  3.470800\n",
       "4  TCGA-AN-A0FL-01A        SNP  2.312675"
      ]
     },
     "execution_count": 8,
     "metadata": {
      "tags": []
     },
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dat.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "mOAOB9yLFzZt"
   },
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 204
    },
    "colab_type": "code",
    "id": "7rGJRBWxDxFS",
    "outputId": "872d8ea4-505e-4163-d1fc-385f1c8257e3"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr:last-of-type th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th colspan=\"8\" halign=\"left\">expr</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th>count</th>\n",
       "      <th>mean</th>\n",
       "      <th>std</th>\n",
       "      <th>min</th>\n",
       "      <th>25%</th>\n",
       "      <th>50%</th>\n",
       "      <th>75%</th>\n",
       "      <th>max</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>group_name</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>DEL</th>\n",
       "      <td>57.0</td>\n",
       "      <td>2.791941</td>\n",
       "      <td>0.322067</td>\n",
       "      <td>2.298823</td>\n",
       "      <td>2.579250</td>\n",
       "      <td>2.743779</td>\n",
       "      <td>2.913681</td>\n",
       "      <td>3.551119</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>INS</th>\n",
       "      <td>15.0</td>\n",
       "      <td>2.642215</td>\n",
       "      <td>0.115888</td>\n",
       "      <td>2.462272</td>\n",
       "      <td>2.546598</td>\n",
       "      <td>2.662163</td>\n",
       "      <td>2.719217</td>\n",
       "      <td>2.888580</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>SNP</th>\n",
       "      <td>303.0</td>\n",
       "      <td>3.218580</td>\n",
       "      <td>0.312959</td>\n",
       "      <td>2.312675</td>\n",
       "      <td>3.100479</td>\n",
       "      <td>3.274329</td>\n",
       "      <td>3.427654</td>\n",
       "      <td>3.796834</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "             expr                      ...                              \n",
       "            count      mean       std  ...       50%       75%       max\n",
       "group_name                             ...                              \n",
       "DEL          57.0  2.791941  0.322067  ...  2.743779  2.913681  3.551119\n",
       "INS          15.0  2.642215  0.115888  ...  2.662163  2.719217  2.888580\n",
       "SNP         303.0  3.218580  0.312959  ...  3.274329  3.427654  3.796834\n",
       "\n",
       "[3 rows x 8 columns]"
      ]
     },
     "execution_count": 16,
     "metadata": {
      "tags": []
     },
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dat.groupby(['group_name']).describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "fQMM_xnzG_VM"
   },
   "outputs": [],
   "source": [
    "from scipy import stats"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 34
    },
    "colab_type": "code",
    "id": "wlhoX7HfHAVs",
    "outputId": "cf6b2b1b-6e26-4ce5-d2a2-c764b767ca3c"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "F_onewayResult(statistic=65.1469883103552, pvalue=5.5310088412358445e-25)"
      ]
     },
     "execution_count": 21,
     "metadata": {
      "tags": []
     },
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dat_DEL = dat[dat['group_name']=='DEL']\n",
    "dat_INS = dat[dat['group_name']=='INS']\n",
    "dat_SNP = dat[dat['group_name']=='SNP']\n",
    "stats.f_oneway(dat_DEL['expr'],dat_INS['expr'],dat_SNP['expr'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "oGp6edrhOdjX"
   },
   "source": [
    "OK, if you run the above BigQuery, you’ll see the same results. We see that the F statistic is really high, which makes sense looking at the difference in mean expression values across the groups (these are log10 expression values)."
   ]
  }
 ],
 "metadata": {
  "colab": {
   "name": "ANOVA_test_in_BigQuery.ipynb",
   "provenance": [],
   "version": "0.3.2"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}