{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# The ISB-CGC open-access TCGA tables in Big-Query\n", "\n", "The goal of this notebook is to introduce you to a new publicly-available, open-access dataset in BigQuery. This set of BigQuery tables was produced by the [ISB-CGC](http://www.isb-cgc.org) project, based on the open-access [TCGA](http://cancergenome.nih.gov/) data available at the TCGA [Data Portal](https://tcga-data.nci.nih.gov/tcga/). You will need to have access to a Google Cloud Platform (GCP) project in order to use BigQuery. If you don't already have one, you can sign up for a [free-trial](https://cloud.google.com/free-trial/) or contact [us](mailto://info@isb-cgc.org) and become part of the community evaluation phase of our Cancer Genomics Cloud pilot. (You can find more information about this NCI-funded program [here](https://cbiit.nci.nih.gov/ncip/nci-cancer-genomics-cloud-pilots).)\n", "\n", "We are not attempting to provide a thorough BigQuery or IPython tutorial here, as a wealth of such information already exists. Here are links to some resources that you might find useful: \n", "* [BigQuery](https://cloud.google.com/bigquery/what-is-bigquery), \n", "* the BigQuery [web UI](https://bigquery.cloud.google.com/) where you can run queries interactively, \n", "* [IPython](http://ipython.org/) (now known as [Jupyter](http://jupyter.org/)), and \n", "* [Cloud Datalab](https://cloud.google.com/datalab/) the recently announced interactive cloud-based platform that this notebook is being developed on. \n", "\n", "There are also many tutorials and samples available on github (see, in particular, the [datalab](https://github.com/GoogleCloudPlatform/datalab) repo and the [Google Genomics]( https://github.com/googlegenomics) project).\n", "\n", "In order to work with BigQuery, the first thing you need to do is import the [gcp.bigquery](http://googlecloudplatform.github.io/datalab/gcp.bigquery.html) package:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import gcp.bigquery as bq" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The next thing you need to know is how to access the specific tables you are interested in. BigQuery tables are organized into datasets, and datasets are owned by a specific GCP project. The tables we are introducing in this notebook are in a dataset called **`tcga_201607_beta`**, owned by the **`isb-cgc`** project. A full table identifier is of the form `:.`. Let's start by getting some basic information about the tables in this dataset:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 6322 rows 1729204 bytes Annotations\n", " 23797 rows 6382147 bytes Biospecimen_data\n", " 11160 rows 4201379 bytes Clinical_data\n", " 2646095 rows 333774244 bytes Copy_Number_segments\n", "3944304319 rows 445303830985 bytes DNA_Methylation_betas\n", " 382335670 rows 43164264006 bytes DNA_Methylation_chr1\n", " 197519895 rows 22301345198 bytes DNA_Methylation_chr10\n", " 235823572 rows 26623975945 bytes DNA_Methylation_chr11\n", " 198050739 rows 22359642619 bytes DNA_Methylation_chr12\n", " 97301675 rows 10986815862 bytes DNA_Methylation_chr13\n", " 123239379 rows 13913712352 bytes DNA_Methylation_chr14\n", " 124566185 rows 14064712239 bytes DNA_Methylation_chr15\n", " 179772812 rows 20296128173 bytes DNA_Methylation_chr16\n", " 234003341 rows 26417830751 bytes DNA_Methylation_chr17\n", " 50216619 rows 5669139362 bytes DNA_Methylation_chr18\n", " 211386795 rows 23862583107 bytes DNA_Methylation_chr19\n", " 279668485 rows 31577200462 bytes DNA_Methylation_chr2\n", " 86858120 rows 9805923353 bytes DNA_Methylation_chr20\n", " 35410447 rows 3997986812 bytes DNA_Methylation_chr21\n", " 70676468 rows 7978947938 bytes DNA_Methylation_chr22\n", " 201119616 rows 22705358910 bytes DNA_Methylation_chr3\n", " 159148744 rows 17968482285 bytes DNA_Methylation_chr4\n", " 195864180 rows 22113162401 bytes DNA_Methylation_chr5\n", " 290275524 rows 32772371379 bytes DNA_Methylation_chr6\n", " 240010275 rows 27097948808 bytes DNA_Methylation_chr7\n", " 164810092 rows 18607886221 bytes DNA_Methylation_chr8\n", " 81260723 rows 9173717922 bytes DNA_Methylation_chr9\n", " 98082681 rows 11072059468 bytes DNA_Methylation_chrX\n", " 2330426 rows 263109775 bytes DNA_Methylation_chrY\n", " 1867233 rows 207365611 bytes Protein_RPPA_data\n", " 5356089 rows 5715538107 bytes Somatic_Mutation_calls\n", " 5738048 rows 657855993 bytes mRNA_BCGSC_GA_RPKM\n", " 38299138 rows 4459086535 bytes mRNA_BCGSC_HiSeq_RPKM\n", " 44037186 rows 5116942528 bytes mRNA_BCGSC_RPKM\n", " 16794358 rows 1934755686 bytes mRNA_UNC_GA_RSEM\n", " 211284521 rows 24942992190 bytes mRNA_UNC_HiSeq_RSEM\n", " 228078879 rows 26877747876 bytes mRNA_UNC_RSEM\n", " 11997545 rows 2000881026 bytes miRNA_BCGSC_GA_isoform\n", " 4503046 rows 527101917 bytes miRNA_BCGSC_GA_mirna\n", " 90237323 rows 15289326462 bytes miRNA_BCGSC_HiSeq_isoform\n", " 28207741 rows 3381212265 bytes miRNA_BCGSC_HiSeq_mirna\n", " 102234868 rows 17290207488 bytes miRNA_BCGSC_isoform\n", " 32710787 rows 3908314182 bytes miRNA_BCGSC_mirna\n", " 26763022 rows 3265303352 bytes miRNA_Expression\n" ] } ], "source": [ "d = bq.DataSet('isb-cgc:tcga_201607_beta')\n", "for t in d.tables():\n", " print '%10d rows %12d bytes %s' \\\n", " % (t.metadata.rows, t.metadata.size, t.name.table_id)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These tables are based on the open-access TCGA data as of July 2016. The molecular data is all \"Level 3\" data, and is divided according to platform/pipeline. See [here](https://tcga-data.nci.nih.gov/tcga/tcgaDataType.jsp) for additional details regarding the TCGA data levels and data types.\n", "\n", "Additional notebooks go into each of these tables in more detail, but here is an overview, in the same alphabetical order that they are listed in above and in the BigQuery web UI:\n", "\n", "\n", "- **Annotations**: This table contains the annotations that are also available from the interactive [TCGA Annotations Manager](https://tcga-data.nci.nih.gov/annotations/). Annotations can be associated with any type of \"item\" (*eg* Patient, Sample, Aliquot, etc), and a single item may have more than one annotation. Common annotations include \"Item flagged DNU\", \"Item is noncanonical\", and \"Prior malignancy.\" More information about this table can be found in the [TCGA Annotations](https://github.com/isb-cgc/examples-Python/blob/master/notebooks/TCGA%20Annotations.ipynb) notebook.\n", "\n", "\n", "- **Biospecimen_data**: This table contains information obtained from the \"biospecimen\" and \"auxiliary\" XML files in the TCGA Level-1 \"bio\" archives. Each row in this table represents a single \"biospecimen\" or \"sample\". Most participants in the TCGA project provided two samples: a \"primary tumor\" sample and a \"blood normal\" sample, but others provided normal-tissue, metastatic, or other types of samples. This table contains metadata about all of the samples, and more information about exploring this table and using this information to create your own custom analysis cohort can be found in the [Creating TCGA cohorts (part 1)](https://github.com/isb-cgc/examples-Python/blob/master/notebooks/Creating%20TCGA%20cohorts%20--%20part%201.ipynb) and [(part 2)](https://github.com/isb-cgc/examples-Python/blob/master/notebooks/Creating%20TCGA%20cohorts%20--%20part%202.ipynb) notebooks.\n", "\n", "\n", "- **Clinical_data**: This table contains information obtained from the \"clinical\" XML files in the TCGA Level-1 \"bio\" archives. Not all fields in the XML files are represented in this table, but any field which was found to be significantly filled-in for at least one tumor-type has been retained. More information about exploring this table and using this information to create your own custom analysis cohort can be found in the [Creating TCGA cohorts (part 1)](https://github.com/isb-cgc/examples-Python/blob/master/notebooks/Creating%20TCGA%20cohorts%20--%20part%201.ipynb) and [(part 2)](https://github.com/isb-cgc/examples-Python/blob/master/notebooks/Creating%20TCGA%20cohorts%20--%20part%202.ipynb) notebooks.\n", "\n", "\n", "- **Copy_Number_segments**: This table contains Level-3 copy-number segmentation results generated by The Broad Institute, from Genome Wide SNP 6 data using the CBS (Circular Binary Segmentation) algorithm. The values are base2 log(copynumber/2), centered on 0. More information about this data table can be found in the [Copy Number segments](https://github.com/isb-cgc/examples-Python/blob/master/notebooks/Copy%20Number%20segments.ipynb) notebook.\n", "\n", "\n", "- **DNA_Methylation_betas**: This table contains Level-3 summary measures of DNA methylation for each interrogated locus (beta values: M/(M+U)). This table contains data from two different platforms: the Illumina Infinium HumanMethylation 27k and 450k arrays. More information about this data table can be found in the [DNA Methylation](https://github.com/isb-cgc/examples-Python/blob/master/notebooks/DNA%20Methylation.ipynb) notebook. Note that individual chromosome-specific DNA Methylation tables are also available to cut down on the amount of data that you may need to query (depending on yoru use case). \n", "\n", "\n", "- **Protein_RPPA_data**: This table contains the normalized Level-3 protein expression levels based on each antibody used to probe the sample. More information about how this data was generated by the RPPA Core Facility at MD Anderson can be found [here](https://wiki.nci.nih.gov/display/TCGA/Protein+Array+Data+Format+Specification#ProteinArrayDataFormatSpecification-Expression-Protein), and more information about this data table can be found in the [Protein expression](https://github.com/isb-cgc/examples-Python/blob/master/notebooks/Protein%20expression.ipynb) notebook.\n", "\n", "\n", "- **Somatic_Mutation_calls**: This table contains annotated somatic mutation calls. All current MAF (Mutation Annotation Format) files were annotated using [Oncotator](http://onlinelibrary.wiley.com/doi/10.1002/humu.22771/abstract;jsessionid=15E7960BA5FEC21EE608E6D262390C52.f01t04) v1.5.1.0, and merged into a single table. More information about this data table can be found in the [Somatic Mutations](https://github.com/isb-cgc/examples-Python/blob/master/notebooks/Somatic%20Mutations.ipynb) notebook, including an example of how to use the [Tute Genomics annotations database in BigQuery](http://googlegenomics.readthedocs.org/en/latest/use_cases/annotate_variants/tute_annotation.html).\n", "\n", "\n", "- **mRNA_BCGSC_HiSeq_RPKM**: This table contains mRNAseq-based gene expression data produced by the [BC Cancer Agency](http://www.bcgsc.ca/). (For details about a very similar table, take a look at a [notebook](https://github.com/isb-cgc/examples-Python/blob/master/notebooks/UNC%20HiSeq%20mRNAseq%20gene%20expression.ipynb) describing the other mRNAseq gene expression table.)\n", "\n", "\n", "- **mRNA_UNC_HiSeq_RSEM**: This table contains mRNAseq-based gene expression data produced by [UNC Lineberger](https://unclineberger.org/). More information about this data table can be found in the [UNC HiSeq mRNAseq gene expression](https://github.com/isb-cgc/examples-Python/blob/master/notebooks/UNC%20HiSeq%20mRNAseq%20gene%20expression.ipynb) notebook.\n", "\n", "\n", "- **miRNA_expression**: This table contains miRNAseq-based expression data for mature microRNAs produced by the [BC Cancer Agency](http://www.bcgsc.ca/). More information about this data table can be found in the [microRNA expression](https://github.com/isb-cgc/examples-Python/blob/master/notebooks/BCGSC%20microRNA%20expression.ipynb) notebook." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Where to start?\n", "We suggest that you start with the two \"Creating TCGA cohorts\" notebooks ([part 1](https://github.com/isb-cgc/examples-Python/blob/master/notebooks/Creating%20TCGA%20cohorts%20--%20part%201.ipynb) and [part 2](https://github.com/isb-cgc/examples-Python/blob/master/notebooks/Creating%20TCGA%20cohorts%20--%20part%202.ipynb)) which describe and make use of the Clinical and Biospecimen tables. From there you can delve into the various molecular data tables as well as the Annotations table. For now these sample notebooks are intentionally relatively simple and do not do any analysis that integrates data from multiple tables but once you have a grasp of how to use the data, developing your own more complex analyses should not be difficult. You could even contribute an example back to our github repository! You are also welcome to submit bug reports, comments, and feature-requests as [github issues](https://github.com/isb-cgc/examples-Python/issues)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### A note about BigQuery tables and \"tidy data\"\n", "You may be used to thinking about a molecular data table such as a gene-expression table as a matrix where the rows are genes and the columns are samples (or *vice versa*). These BigQuery tables instead use the [tidy data](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html) approach, with each \"cell\" from the traditional data-matrix becoming a single row in the BigQuery table. A 10,000 gene x 500 sample matrix would therefore become a 5,000,000 row BigQuery table." ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.9" } }, "nbformat": 4, "nbformat_minor": 0 }