{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "Ona5p1IgVCrb"
   },
   "source": [
    "# ISB-CGC Community Notebooks\n",
    "\n",
    "Check out more notebooks at our [Community Notebooks Repository](https://github.com/isb-cgc/Community-Notebooks)!\n",
    "\n",
    "```\n",
    "Title:   How to perform complex joins\n",
    "Author:  David L Gibbs\n",
    "Created: 2020-04-21\n",
    "Purpose: Basic overview of complex joins in BigQuery\n",
    "URL:     https://github.com/isb-cgc/Community-Notebooks/blob/master/Notebooks/How_to_perform_complex_joins.ipynb\n",
    "Notes:   This covers joining multiple tables based on a sample identifier.\n",
    "```\n",
    "***\n",
    "\n",
    "This notebook will show you how to perform a complex join using the publicly available TCGA BigQuery tables that the [ISB-CGC](http://isb-cgc.org) project has produced based on the open-access [TCGA](http://cancergenome.nih.gov/) data available at the [Data Portal](https://portal.gdc.cancer.gov/).  You will need to have access to a Google Cloud Platform (GCP) project in order to use BigQuery.  If you don't already have one, you can sign up for a [free-trial](https://cloud.google.com/free-trial/). You can also explore the available tables and data sets before commiting to creating a GCP project though the [ISB-CGC BigQuery Table Searcher](isb-cgc.appspot.com/bq_meta_search/).\n",
    "\n",
    "We are not attempting to provide a thorough BigQuery or IPython tutorial here, as a wealth of such information already exists.  Here are some links to some resources that you might find useful: \n",
    "* [BigQuery](https://cloud.google.com/bigquery/what-is-bigquery)\n",
    "* the BigQuery [web UI](https://console.cloud.google.com/bigquery) \n",
    "    * where you can run queries interactively\n",
    "* [Jupyter Notebooks](http://jupyter.org/)\n",
    "* [Google Cloud Datalab](https://cloud.google.com/datalab/) \n",
    "    * interactive cloud-based platform for analyzing data built on the Jupyter Notebooks\n",
    "* [Google Colaboratory](https://colab.research.google.com/)\n",
    "    * Free Jupyter Notebook environment that runs in your browser\n",
    "\n",
    "There are also many tutorials and samples available on github (see, in particular, the [datalab](https://github.com/GoogleCloudPlatform/datalab) repo, the [Google Genomics](  https://github.com/googlegenomics) project), and our own [Community Notebooks](https://github.com/isb-cgc/Community-Notebooks).\n",
    "\n",
    "OK then, let's get started!  In order to work with BigQuery, the first thing you need to do is import the bigquery module:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip3 install google"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "!pip3 install google.cloud.bigquery"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "MFLeeww_VCrh"
   },
   "outputs": [],
   "source": [
    "import google.cloud.bigquery as bigquery"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "64nmsYB1VMta"
   },
   "source": [
    "Next, we need to Authorize ourselves. For more information see ['Quick Start Guide to ISB-CGC'](https://nbviewer.jupyter.org/github/isb-cgc/Community-Notebooks/blob/master/Notebooks/Quick_Start_Guide_to_ISB_CGC.ipynb) and alternative authentication methods can be found [here](https://googleapis.github.io/google-cloud-python/latest/core/auth.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "both",
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 360
    },
    "colab_type": "code",
    "id": "QgN9GwOZ2ttz",
    "outputId": "89af1b8a-f1a2-43d6-99c6-d6b99044f9c4"
   },
   "outputs": [],
   "source": [
    "!gcloud auth application-default login\n",
    "# If you're in a notebook, this might not allow for a reponse to be entered.\n",
    "# in that case, in Jupyter Lab you can open a terminal and do it there. (same command)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "DNqD9CkHb34J"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Please update the project number with your Google Cloud Project\n"
     ]
    }
   ],
   "source": [
    "# Create a variable for which client to use with BigQuery\n",
    "project_num = 'your_project_number' # Update with your Google Project number\n",
    "\n",
    "if project_num == 'your_project_number':\n",
    "    print('Please update the project number with your Google Cloud Project')\n",
    "else:\n",
    "    client = bigquery.Client(project_num) # Replace your_project_number with your project ID"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%load_ext google.cloud.bigquery"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "I4qZY5-6VCrl"
   },
   "source": [
    "BigQuery tables are organized into datasets, and datasets are owned by a specific GCP project.  The tables we will be working with in this notebook are found in datasets  **`TCGA_bioclin_v0`**, **`platform_reference`**, and **`TCGA_hg19_data_v0`, all owned by the **`isb-cgc`** project.  A full table identifier is of the form `<project_id>.<dataset_id>.<table_id>`. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "MzG3wtPPVCrq"
   },
   "source": [
    "## Join Join Join Join\n",
    "\n",
    "Our strategy is going to based on linking up a series of subtables, rather than trying to do all the joins in one go. It's a little like rolling up a snowball.\n",
    "\n",
    "Let's start by looking at the clinical data table, and we'll build up the query, table by table."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "q4tK5sIO5DZy"
   },
   "source": [
    "#### Starting at the top\n",
    "Notice that in the 'group by', we use integers that reference the selected terms."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 204
    },
    "colab_type": "code",
    "id": "L9QCMBJLVCsA",
    "outputId": "775d9b91-8d2c-4cdf-cb33-e0efa35c2619"
   },
   "outputs": [],
   "source": [
    "%%bigquery --project project_num \n",
    "\n",
    "WITH\n",
    "clinTab AS (\n",
    "    select\n",
    "      project_short_name, case_barcode, age_at_diagnosis, pathologic_stage\n",
    "    from\n",
    "      `isb-cgc.TCGA_bioclin_v0.Clinical` as Clin\n",
    "    where\n",
    "      project_short_name = \"TCGA-BRCA\" AND\n",
    "      Clin.age_at_diagnosis < 50 \n",
    "    group by \n",
    "      1,2,3,4\n",
    ")\n",
    "\n",
    "select * from clinTab limit 5"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### First Join\n",
    "Here, each row of the clinical table is going to get joined to *each* protein-and-barcode combo."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bigquery pdf1 --project project_num\n",
    "\n",
    "WITH\n",
    "clinTab AS (\n",
    "    select\n",
    "      project_short_name, case_barcode, age_at_diagnosis, pathologic_stage\n",
    "    from\n",
    "      `isb-cgc.TCGA_bioclin_v0.Clinical` as Clin\n",
    "    where\n",
    "      project_short_name = \"TCGA-BRCA\" AND\n",
    "      Clin.age_at_diagnosis < 50 \n",
    "    group by \n",
    "      1,2,3,4\n",
    "),\n",
    "protTab AS (\n",
    "    select\n",
    "         Prot.case_barcode, Prot.project_short_name, Prot.protein_expression, \n",
    "         Prot.gene_name as gene_name, Prot.protein_name,\n",
    "         clinTab.age_at_diagnosis, clinTab.pathologic_stage\n",
    "    from\n",
    "        `isb-cgc.TCGA_hg19_data_v0.Protein_Expression` Prot\n",
    "    join\n",
    "         clinTab\n",
    "    on clinTab.case_barcode = Prot.case_barcode AND clinTab.project_short_name = Prot.project_short_name\n",
    "    where\n",
    "        regexp_contains(Prot.gene_name, '^P')\n",
    "    group by\n",
    "          1,2,3,4,5,6,7\n",
    ")\n",
    "\n",
    "select * from protTab"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pdf1.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pdf1.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Second Join\n",
    "Now we'll bring in variant data.  Notice in the 'mutTab', the features that were originally from the clinical table, are now part of Prot. We are also going to join only deletions which are called 'DEL's."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bigquery mdf1 --project project_num \n",
    "\n",
    "WITH\n",
    "clinTab AS (\n",
    "    select\n",
    "        project_short_name, case_barcode, age_at_diagnosis, pathologic_stage\n",
    "    from\n",
    "        `isb-cgc.TCGA_bioclin_v0.Clinical` as Clin\n",
    "    where\n",
    "        project_short_name = \"TCGA-BRCA\" AND\n",
    "        Clin.age_at_diagnosis < 50 \n",
    "    group by \n",
    "        1,2,3,4\n",
    "),\n",
    "\n",
    "protTab AS (\n",
    "    select\n",
    "         clinTab.age_at_diagnosis, clinTab.pathologic_stage,\n",
    "         Prot.case_barcode, Prot.project_short_name, Prot.protein_expression, \n",
    "         Prot.gene_name as gene_name, Prot.protein_name\n",
    "    from\n",
    "        `isb-cgc.TCGA_hg19_data_v0.Protein_Expression` Prot\n",
    "    join\n",
    "         clinTab\n",
    "    on clinTab.case_barcode = Prot.case_barcode AND clinTab.project_short_name = Prot.project_short_name\n",
    "    where\n",
    "        regexp_contains(Prot.gene_name, '^P')\n",
    "    group by\n",
    "          1,2,3,4,5,6,7\n",
    "),\n",
    "\n",
    "mutTab AS (\n",
    "    select\n",
    "        prot.age_at_diagnosis, prot.pathologic_stage,\n",
    "        Prot.case_barcode, Prot.project_short_name, Prot.protein_expression, \n",
    "        Prot.gene_name as gene_name, Prot.protein_name,\n",
    "        mut.Variant_Type, mut.Genome_Change, mut.Mutation_Status \n",
    "    from\n",
    "        `isb-cgc.TCGA_hg19_data_v0.Somatic_Mutation_DCC` mut\n",
    "    join\n",
    "        protTab as prot\n",
    "    on prot.case_barcode = mut.case_barcode\n",
    "        and prot.project_short_name = mut.project_short_name\n",
    "        and prot.gene_name = mut.Hugo_Symbol\n",
    "    where\n",
    "        mut.Variant_Type = 'DEL'\n",
    "    group by 1,2,3,4,5,6,7,8,9,10\n",
    ")\n",
    "\n",
    "select * from mutTab"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mdf1.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mdf1.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Third Join -- detour\n",
    "Now to bring in the methylation data.  This data, from Illumina methylation arrays, the Human Methylation 27k and 450k platforms, is indexed by probe IDs. That means we first need to do ane extra join to bring in gene symbols.  Also, observe that the methylation annotation is a nested table. In the table schema, you will notice the UCSC field is a record with three components, RefGene_Name, RefGene_Group, and RefGene_Accession. To make our work easier, we want to take this nested table and make it a 'flat' table. For some documentation on doing that see [this doc](https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays#flattening_arrays). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bigquery prdf1 --project project_num \n",
    "\n",
    "SELECT\n",
    "    IlmnID,\n",
    "    RefGene_Name,\n",
    "    RefGene_Group,\n",
    "    Infinium_Design_Type\n",
    "FROM\n",
    "    `isb-cgc.platform_reference.methylation_annotation`,\n",
    "UNNEST(UCSC) \n",
    "limit 10\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "prdf1.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since we made our annotation table flat, we can join it in using the regular method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bigquery ardf1 --project project_num \n",
    "\n",
    "WITH\n",
    "probes AS (\n",
    "  SELECT\n",
    "    IlmnID,\n",
    "    RefGene_Name,\n",
    "    RefGene_Group,\n",
    "    Infinium_Design_Type\n",
    "  FROM\n",
    "    `isb-cgc.platform_reference.methylation_annotation`,\n",
    "    UNNEST(UCSC) ),\n",
    "   \n",
    "methAnnot as (\n",
    "select\n",
    "  Methyl.probe_id,Methyl.beta_value,Methyl.platform,\n",
    "  Methyl.case_barcode, Methyl.project_short_name,\n",
    "  probes.RefGene_Name, probes.RefGene_Group, probes.Infinium_Design_Type\n",
    "from\n",
    "  (select * from `isb-cgc.TCGA_hg19_data_v0.DNA_Methylation_chr22` limit 100) Methyl  -- SUB-TABLE HERE!\n",
    "join\n",
    "  probes\n",
    "on probes.IlmnID = Methyl.probe_id\n",
    "where\n",
    "    project_short_name = \"TCGA-BRCA\"\n",
    ")\n",
    "    \n",
    "select * from methAnnot limit 10"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ardf1.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ardf1.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Third Join -- making the turn\n",
    "Now that we've mapped probe ids to gene symbols, we'll bring in methylation data.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bigquery bigdf --project project_num \n",
    "\n",
    "WITH\n",
    "clinTab AS (\n",
    "    select\n",
    "        project_short_name, case_barcode, age_at_diagnosis, pathologic_stage\n",
    "    from\n",
    "        `isb-cgc.TCGA_bioclin_v0.Clinical` as Clin\n",
    "    where\n",
    "        project_short_name = \"TCGA-BRCA\" AND\n",
    "        Clin.age_at_diagnosis < 50 \n",
    "    group by \n",
    "        1,2,3,4\n",
    "),\n",
    "\n",
    "protTab AS (\n",
    "    select\n",
    "         clinTab.age_at_diagnosis, clinTab.pathologic_stage,\n",
    "         Prot.case_barcode, Prot.project_short_name, Prot.protein_expression, \n",
    "         Prot.gene_name as gene_name, Prot.protein_name\n",
    "    from\n",
    "        `isb-cgc.TCGA_hg19_data_v0.Protein_Expression` Prot\n",
    "    join\n",
    "         clinTab\n",
    "    on clinTab.case_barcode = Prot.case_barcode AND clinTab.project_short_name = Prot.project_short_name\n",
    "    where\n",
    "        regexp_contains(Prot.gene_name, '^P')\n",
    "    group by\n",
    "          1,2,3,4,5,6,7\n",
    "),\n",
    "\n",
    "mutTab AS (\n",
    "    select\n",
    "        prot.age_at_diagnosis, prot.pathologic_stage,\n",
    "        Prot.case_barcode, Prot.project_short_name, Prot.protein_expression, \n",
    "        Prot.gene_name as gene_name, Prot.protein_name,\n",
    "        mut.Variant_Type, mut.Genome_Change, mut.Mutation_Status \n",
    "    from\n",
    "        `isb-cgc.TCGA_hg19_data_v0.Somatic_Mutation_DCC` mut\n",
    "    join\n",
    "        protTab as prot\n",
    "    on prot.case_barcode = mut.case_barcode\n",
    "        and prot.project_short_name = mut.project_short_name\n",
    "        and prot.gene_name = mut.Hugo_Symbol\n",
    "    where\n",
    "        mut.Variant_Type = 'DEL'\n",
    "    group by 1,2,3,4,5,6,7,8,9,10\n",
    "),\n",
    "\n",
    "probes AS (\n",
    "  SELECT\n",
    "    IlmnID,\n",
    "    RefGene_Name,\n",
    "    RefGene_Group,\n",
    "    Infinium_Design_Type\n",
    "  FROM\n",
    "    `isb-cgc.platform_reference.methylation_annotation`,\n",
    "    UNNEST(UCSC) \n",
    "),\n",
    "   \n",
    "methAnnot as (\n",
    "select\n",
    "  Methyl.probe_id,Methyl.beta_value,Methyl.platform,\n",
    "  Methyl.case_barcode, Methyl.project_short_name,\n",
    "  probes.RefGene_Name, probes.RefGene_Group, probes.Infinium_Design_Type\n",
    "from\n",
    "  `isb-cgc.TCGA_hg19_data_v0.DNA_Methylation_chr22` Methyl  \n",
    "join\n",
    "  probes\n",
    "on probes.IlmnID = Methyl.probe_id\n",
    "where\n",
    "    project_short_name = \"TCGA-BRCA\"\n",
    "),\n",
    "    \n",
    "methTab as (\n",
    "select\n",
    "    mutTab.age_at_diagnosis, mutTab.pathologic_stage,\n",
    "    mutTab.case_barcode, mutTab.project_short_name, mutTab.protein_expression, \n",
    "    mutTab.gene_name as gene_name, mutTab.protein_name,\n",
    "    mutTab.Variant_Type, mutTab.Genome_Change, mutTab.Mutation_Status,\n",
    "    Methyl.probe_id, Methyl.beta_value, Methyl.platform, \n",
    "    Methyl.RefGene_Group, Methyl.Infinium_Design_Type\n",
    "from\n",
    "    methAnnot Methyl\n",
    "join\n",
    "    mutTab\n",
    "on\n",
    "    mutTab.case_barcode = Methyl.case_barcode\n",
    "    and mutTab.project_short_name = Methyl.project_short_name\n",
    "    and mutTab.gene_name = Methyl.RefGene_Name\n",
    "group by\n",
    "  1,2,3,4,5,6,7,8,9,10,11,12,13,14,15\n",
    ")\n",
    "\n",
    "select * from mutTab "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bigdf.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bigdf.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### That's it!\n",
    "\n",
    "We've created a big wide table that's the result of joining 5 separate tables. Whew!  Send us an email, how did it go?  \n",
    "feedback@isb-cgc.org"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [],
   "name": "How_to_create_cohorts.ipynb",
   "provenance": [],
   "version": "0.3.2"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}