{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# The ISB-CGC open-access TCGA tables in Big-Query\n",
    "\n",
    "The goal of this notebook is to introduce you to a new publicly-available, open-access dataset in BigQuery.  This set of BigQuery tables was produced by the [ISB-CGC](http://www.isb-cgc.org) project, based on the open-access [TCGA](http://cancergenome.nih.gov/) data available at the TCGA [Data Portal](https://tcga-data.nci.nih.gov/tcga/).  You will need to have access to a Google Cloud Platform (GCP) project in order to use BigQuery.  If you don't already have one, you can sign up for a [free-trial](https://cloud.google.com/free-trial/) or contact [us](mailto://info@isb-cgc.org) and become part of the community evaluation phase of our Cancer Genomics Cloud pilot.  (You can find more information about this NCI-funded program [here](https://cbiit.nci.nih.gov/ncip/nci-cancer-genomics-cloud-pilots).)\n",
    "\n",
    "We are not attempting to provide a thorough BigQuery or IPython tutorial here, as a wealth of such information already exists.  Here are links to some resources that you might find useful: \n",
    "* [BigQuery](https://cloud.google.com/bigquery/what-is-bigquery), \n",
    "* the BigQuery [web UI](https://bigquery.cloud.google.com/) where you can run queries interactively, \n",
    "* [IPython](http://ipython.org/) (now known as [Jupyter](http://jupyter.org/)), and \n",
    "* [Cloud Datalab](https://cloud.google.com/datalab/) the recently announced interactive cloud-based platform that this notebook is being developed on. \n",
    "\n",
    "There are also many tutorials and samples available on github (see, in particular, the [datalab](https://github.com/GoogleCloudPlatform/datalab) repo and the [Google Genomics](  https://github.com/googlegenomics) project).\n",
    "\n",
    "In order to work with BigQuery, the first thing you need to do is import the [gcp.bigquery](http://googlecloudplatform.github.io/datalab/gcp.bigquery.html) package:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import gcp.bigquery as bq"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The next thing you need to know is how to access the specific tables you are interested in.  BigQuery tables are organized into datasets, and datasets are owned by a specific GCP project.  The tables we are introducing in this notebook are in a dataset called **`tcga_201607_beta`**, owned by the **`isb-cgc`** project.  A full table identifier is of the form `<project_id>:<dataset_id>.<table_id>`.  Let's start by getting some basic information about the tables in this dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "      6322 rows       1729204 bytes   Annotations\n",
      "     23797 rows       6382147 bytes   Biospecimen_data\n",
      "     11160 rows       4201379 bytes   Clinical_data\n",
      "   2646095 rows     333774244 bytes   Copy_Number_segments\n",
      "3944304319 rows  445303830985 bytes   DNA_Methylation_betas\n",
      " 382335670 rows   43164264006 bytes   DNA_Methylation_chr1\n",
      " 197519895 rows   22301345198 bytes   DNA_Methylation_chr10\n",
      " 235823572 rows   26623975945 bytes   DNA_Methylation_chr11\n",
      " 198050739 rows   22359642619 bytes   DNA_Methylation_chr12\n",
      "  97301675 rows   10986815862 bytes   DNA_Methylation_chr13\n",
      " 123239379 rows   13913712352 bytes   DNA_Methylation_chr14\n",
      " 124566185 rows   14064712239 bytes   DNA_Methylation_chr15\n",
      " 179772812 rows   20296128173 bytes   DNA_Methylation_chr16\n",
      " 234003341 rows   26417830751 bytes   DNA_Methylation_chr17\n",
      "  50216619 rows    5669139362 bytes   DNA_Methylation_chr18\n",
      " 211386795 rows   23862583107 bytes   DNA_Methylation_chr19\n",
      " 279668485 rows   31577200462 bytes   DNA_Methylation_chr2\n",
      "  86858120 rows    9805923353 bytes   DNA_Methylation_chr20\n",
      "  35410447 rows    3997986812 bytes   DNA_Methylation_chr21\n",
      "  70676468 rows    7978947938 bytes   DNA_Methylation_chr22\n",
      " 201119616 rows   22705358910 bytes   DNA_Methylation_chr3\n",
      " 159148744 rows   17968482285 bytes   DNA_Methylation_chr4\n",
      " 195864180 rows   22113162401 bytes   DNA_Methylation_chr5\n",
      " 290275524 rows   32772371379 bytes   DNA_Methylation_chr6\n",
      " 240010275 rows   27097948808 bytes   DNA_Methylation_chr7\n",
      " 164810092 rows   18607886221 bytes   DNA_Methylation_chr8\n",
      "  81260723 rows    9173717922 bytes   DNA_Methylation_chr9\n",
      "  98082681 rows   11072059468 bytes   DNA_Methylation_chrX\n",
      "   2330426 rows     263109775 bytes   DNA_Methylation_chrY\n",
      "   1867233 rows     207365611 bytes   Protein_RPPA_data\n",
      "   5356089 rows    5715538107 bytes   Somatic_Mutation_calls\n",
      "   5738048 rows     657855993 bytes   mRNA_BCGSC_GA_RPKM\n",
      "  38299138 rows    4459086535 bytes   mRNA_BCGSC_HiSeq_RPKM\n",
      "  44037186 rows    5116942528 bytes   mRNA_BCGSC_RPKM\n",
      "  16794358 rows    1934755686 bytes   mRNA_UNC_GA_RSEM\n",
      " 211284521 rows   24942992190 bytes   mRNA_UNC_HiSeq_RSEM\n",
      " 228078879 rows   26877747876 bytes   mRNA_UNC_RSEM\n",
      "  11997545 rows    2000881026 bytes   miRNA_BCGSC_GA_isoform\n",
      "   4503046 rows     527101917 bytes   miRNA_BCGSC_GA_mirna\n",
      "  90237323 rows   15289326462 bytes   miRNA_BCGSC_HiSeq_isoform\n",
      "  28207741 rows    3381212265 bytes   miRNA_BCGSC_HiSeq_mirna\n",
      " 102234868 rows   17290207488 bytes   miRNA_BCGSC_isoform\n",
      "  32710787 rows    3908314182 bytes   miRNA_BCGSC_mirna\n",
      "  26763022 rows    3265303352 bytes   miRNA_Expression\n"
     ]
    }
   ],
   "source": [
    "d = bq.DataSet('isb-cgc:tcga_201607_beta')\n",
    "for t in d.tables():\n",
    "  print '%10d rows  %12d bytes   %s' \\\n",
    "      % (t.metadata.rows, t.metadata.size, t.name.table_id)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These tables are based on the open-access TCGA data as of July 2016.  The molecular data is all \"Level 3\" data, and is divided according to platform/pipeline.  See [here](https://tcga-data.nci.nih.gov/tcga/tcgaDataType.jsp) for additional details regarding the TCGA data levels and data types.\n",
    "\n",
    "Additional notebooks go into each of these tables in more detail, but here is an overview, in the same alphabetical order that they are listed in above and in the BigQuery web UI:\n",
    "\n",
    "\n",
    "- **Annotations**:  This table contains the annotations that are also available from the interactive [TCGA Annotations Manager](https://tcga-data.nci.nih.gov/annotations/).  Annotations can be associated with any type of \"item\" (*eg* Patient, Sample, Aliquot, etc), and a single item may have more than one annotation.  Common annotations include \"Item flagged DNU\", \"Item is noncanonical\", and \"Prior malignancy.\"  More information about this table can be found in the [TCGA Annotations](https://github.com/isb-cgc/examples-Python/blob/master/notebooks/TCGA%20Annotations.ipynb) notebook.\n",
    "\n",
    "\n",
    "- **Biospecimen_data**:  This table contains information obtained from the \"biospecimen\" and \"auxiliary\" XML files in the TCGA Level-1 \"bio\" archives.  Each row in this table represents a single \"biospecimen\" or \"sample\".  Most participants in the TCGA project provided two samples: a \"primary tumor\" sample and a \"blood normal\" sample, but others provided normal-tissue, metastatic, or other types of samples.  This table contains metadata about all of the samples, and more information about exploring this table and using this information to create your own custom analysis cohort can be found  in the [Creating TCGA cohorts (part 1)](https://github.com/isb-cgc/examples-Python/blob/master/notebooks/Creating%20TCGA%20cohorts%20--%20part%201.ipynb) and [(part 2)](https://github.com/isb-cgc/examples-Python/blob/master/notebooks/Creating%20TCGA%20cohorts%20--%20part%202.ipynb) notebooks.\n",
    "\n",
    "\n",
    "- **Clinical_data**:  This table contains information obtained from the \"clinical\" XML files in the TCGA Level-1 \"bio\" archives.  Not all fields in the XML files are represented in this table, but any field which was found to be significantly filled-in for at least one tumor-type has been retained.  More information about exploring this table and using this information to create your own custom analysis cohort can be found in the [Creating TCGA cohorts (part 1)](https://github.com/isb-cgc/examples-Python/blob/master/notebooks/Creating%20TCGA%20cohorts%20--%20part%201.ipynb) and [(part 2)](https://github.com/isb-cgc/examples-Python/blob/master/notebooks/Creating%20TCGA%20cohorts%20--%20part%202.ipynb) notebooks.\n",
    "\n",
    "\n",
    "- **Copy_Number_segments**:  This table contains Level-3 copy-number segmentation results generated by The Broad Institute, from Genome Wide SNP 6 data using the CBS (Circular Binary Segmentation) algorithm.  The values are base2 log(copynumber/2), centered on 0.  More information about this data table can be found in the [Copy Number segments](https://github.com/isb-cgc/examples-Python/blob/master/notebooks/Copy%20Number%20segments.ipynb) notebook.\n",
    "\n",
    "\n",
    "- **DNA_Methylation_betas**:  This table contains Level-3 summary measures of DNA methylation for each interrogated locus (beta values: M/(M+U)).  This table contains data from two different platforms: the Illumina Infinium HumanMethylation 27k and 450k arrays.  More information about this data table can be found in the [DNA Methylation](https://github.com/isb-cgc/examples-Python/blob/master/notebooks/DNA%20Methylation.ipynb) notebook.  Note that individual chromosome-specific DNA Methylation tables are also available to cut down on the amount of data that you may need to query (depending on yoru use case).  \n",
    "\n",
    "\n",
    "- **Protein_RPPA_data**:  This table contains the normalized Level-3 protein expression levels based on each antibody used to probe the sample.  More information about how this data was generated by the RPPA Core Facility at MD Anderson can be found [here](https://wiki.nci.nih.gov/display/TCGA/Protein+Array+Data+Format+Specification#ProteinArrayDataFormatSpecification-Expression-Protein), and more information about this data table can be found in the [Protein expression](https://github.com/isb-cgc/examples-Python/blob/master/notebooks/Protein%20expression.ipynb) notebook.\n",
    "\n",
    "\n",
    "- **Somatic_Mutation_calls**: This table contains annotated somatic mutation calls.  All current MAF (Mutation Annotation Format) files were annotated using [Oncotator](http://onlinelibrary.wiley.com/doi/10.1002/humu.22771/abstract;jsessionid=15E7960BA5FEC21EE608E6D262390C52.f01t04) v1.5.1.0, and merged into a single table.  More information about this data table can be found in the [Somatic Mutations](https://github.com/isb-cgc/examples-Python/blob/master/notebooks/Somatic%20Mutations.ipynb) notebook, including an example of how to use the [Tute Genomics annotations database in BigQuery](http://googlegenomics.readthedocs.org/en/latest/use_cases/annotate_variants/tute_annotation.html).\n",
    "\n",
    "\n",
    "- **mRNA_BCGSC_HiSeq_RPKM**: This table contains mRNAseq-based gene expression data produced by the [BC Cancer Agency](http://www.bcgsc.ca/).  (For details about a very similar table, take a look at a [notebook](https://github.com/isb-cgc/examples-Python/blob/master/notebooks/UNC%20HiSeq%20mRNAseq%20gene%20expression.ipynb) describing the other mRNAseq gene expression table.)\n",
    "\n",
    "\n",
    "- **mRNA_UNC_HiSeq_RSEM**: This table contains mRNAseq-based gene expression data produced by [UNC Lineberger](https://unclineberger.org/).  More information about this data table can be found in the [UNC HiSeq mRNAseq gene expression](https://github.com/isb-cgc/examples-Python/blob/master/notebooks/UNC%20HiSeq%20mRNAseq%20gene%20expression.ipynb) notebook.\n",
    "\n",
    "\n",
    "- **miRNA_expression**: This table contains miRNAseq-based expression data for mature microRNAs produced by the [BC Cancer Agency](http://www.bcgsc.ca/).  More information about this data table can be found in the [microRNA expression](https://github.com/isb-cgc/examples-Python/blob/master/notebooks/BCGSC%20microRNA%20expression.ipynb) notebook."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Where to start?\n",
    "We suggest that you start with the two \"Creating TCGA cohorts\" notebooks ([part 1](https://github.com/isb-cgc/examples-Python/blob/master/notebooks/Creating%20TCGA%20cohorts%20--%20part%201.ipynb) and [part 2](https://github.com/isb-cgc/examples-Python/blob/master/notebooks/Creating%20TCGA%20cohorts%20--%20part%202.ipynb)) which describe and make use of the Clinical and Biospecimen tables.  From there you can delve into the various molecular data tables as well as the Annotations table.  For now these sample notebooks are intentionally relatively simple and do not do any analysis that integrates data from multiple tables but once you have a grasp of how to use the data, developing your own more complex analyses should not be difficult.  You could even contribute an example back to our github repository!  You are also welcome to submit bug reports, comments, and feature-requests as [github issues](https://github.com/isb-cgc/examples-Python/issues)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### A note about BigQuery tables and \"tidy data\"\n",
    "You may be used to thinking about a molecular data table such as a gene-expression table as a matrix where the rows are genes and the columns are samples (or *vice versa*).  These BigQuery tables instead use the [tidy data](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html) approach, with each \"cell\" from the traditional data-matrix becoming a single row in the BigQuery table.  A 10,000 gene x 500 sample matrix would therefore become a 5,000,000 row BigQuery table."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}