{ "metadata": { "name": "", "signature": "sha256:c4ef5942519b0625d297c7d65f712b646926b88f0b6635d9bf3a00841197e06b" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "#Software Overview \n", "\n", "This repository contains instructions for reproduction and extension of [Multi-tiered genomic analysis of head and neck cancer ties TP53 mutation to 3p loss]() by Gross et al. In general code for data-processing and computation is enclosed in standard python modules, while high level analyis was recorded in IPython Notebooks. The analysis for this project was relatively non-linear and has thus been split into a number of notebooks as described in [Analysis Notebooks](./Analysis_Notebooks#guide-to-running), but results should be able to be replicated by running these notebooks. \n", "\n", "__As of July 1, 2014 all error bars are off due to a [Pandas bug](https://github.com/pydata/pandas/issues/7643). They now show the difference between the mean and the lower bound as the uncertanty for the upper and lower bound rather than show the true 95% confidence interval... hopefully this will be addressed soon.__\n", "\n", "##Dependencies \n", "\n", "This code uses a number of features in the scientific python stack as well as a small set of standard R libraries. Thus far, this code has only been tested in a Linux enviroment, it may take some modification to run on other operating systems.\n", "\n", "I highly recomend installing a scientific Python distribution such as [Anaconda](http://continuum.io/) or [Enthought](https://www.enthought.com/) to handle the majority of the Python dependencies in this project (other than rPy2 and matplotlib_venn). These are both free for academic use.\n", "\n", "###Python Dependencies \n", "* [Numpy and Scipy](http://www.scipy.org/), numeric calculations and statistics in Python \n", "* [matplotlib](http://matplotlib.org/), plotting in Python\n", "* [Pandas](http://pandas.pydata.org/), data-frames for Python, handles the majority of data-structures \n", "* [statsmodels](http://statsmodels.sourceforge.net/), used for statstics \n", "* [scikit-learn](http://scikit-learn.org/stable/), used for supervised learning\n", "* [rPy2](http://rpy.sourceforge.net/rpy2.html), communication between R and Python \n", " * __NOT IN DISTRIBUTIONS__ \n", " * I recommend installing with `pip install rpy2` \n", " * Needs R to be compiled with shared libraries \n", "* [matplotlib_venn](https://pypi.python.org/pypi/matplotlib-venn) \n", " * __NOT IN DISTRIBUTIONS__ \n", " * I recommend installing with `pip install matplotlib_venn` \n", " * Only used for Venn diagrams, not essential\n", " \n", " \n", "###R Dependencies\n", "* Needs to be compiled with shared libraries to communicate with Python (_this can be tricky_)\n", "* Packages\n", " * base\n", " * survival\n", " * MASS\n", " \n", "###Command Line Dependencies \n", "* curl (http://curl.haxx.se/) for fetching urls" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
As of July 1, 2014 all error bars are off due to a [Pandas bug](https://github.com/pydata/pandas/issues/7643). They now show the difference between the mean and the lower bound as the uncertanty for the upper and lower bound rather than show the true 95% confidence interval... hopefully this will be addressed soon.
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#Guide to Running \n", "\n", "##Initialization\n", "* [__download_data__](./download_data.ipynb) ([nbviewer](http://nbviewer.ipython.org/github/theandygross/TCGA/blob/master/Analysis_Notebooks/download_data.ipynb))\n", " Pulls all of the necessary data from the net and constructs the file tree and data objects used in the rest of the analysis. \n", " \n", " \n", "* [__get_all_MAFs__](./get_all_MAFs.ipynb) ([nbviewer](http://nbviewer.ipython.org/github/theandygross/TCGA/blob/master/Analysis_Notebooks/get_all_MAFs.ipynb))\n", " Script to download and process updated MAF files from the TCGA Data Portal. \n", " \n", " \n", "* [__get_updated_clinical__](./get_updated_clinical.ipynb) ([nbviewer](http://nbviewer.ipython.org/github/theandygross/TCGA/blob/master/Analysis_Notebooks/get_updated_clinical.ipynb)) \n", " Script to download and process updated clinical data from the TCGA Data Portal.\n", " \n", " \n", "##Primary Analysis \n", "(There are dependencies among these, run them in order.)\n", "* [__HPV_Process_Data__](./HPV_Process_Data.ipynb) ([nbviewer](http://nbviewer.ipython.org/github/theandygross/TCGA/blob/master/Analysis_Notebooks/HPV_Process_Data.ipynb)) \n", " Compile HPV status for all patient tumors. \n", " Calculate global variables and meta features in the HPV- background. \n", " \n", "* [__binarize_clinical__](./binarize_clinical.ipynb) ([nbviewer](http://nbviewer.ipython.org/github/theandygross/TCGA/blob/master/Analysis_Notebooks/binarize_clinical.ipynb)) \n", " Process clinical variables into binary matrix for use in prognostic screens. \n", "\n", "* [__Prognostic_Screen__](./Prognostic_Screen.ipynb) ([nbviewer](http://nbviewer.ipython.org/github/theandygross/TCGA/blob/master/Analysis_Notebooks/Prognostic_Screen.ipynb)) \n", " Run the primary prognostic screen for HPV- HNSCC patients. \n", " \n", " \n", "* [__Secondary_Screen__](./Secondary_Screen.ipynb) ([nbviewer](http://nbviewer.ipython.org/github/theandygross/TCGA/blob/master/Analysis_Notebooks/Secondary_Screen.ipynb)) \n", " Run the prognostic screen for HPV- HNSCC patients with the TP53-3p event.\n", " \n", " \n", "* [__HNSCC_figures__](./HNSCC_figures.ipynb) ([nbviewer](http://nbviewer.ipython.org/github/theandygross/TCGA/blob/master/Analysis_Notebooks/HNSCC_figures.ipynb)) \n", " Generate some of the figure panels for the HNSCC discovery cohort. Some of the other figures and figure panels are generated inline with analysis. \n", " \n", " \n", "##Validation Cohorts\n", "\n", "* [__UPMC_cohort__](./UPMC_cohort.ipynb) ([nbviewer](http://nbviewer.ipython.org/github/theandygross/TCGA/blob/master/Analysis_Notebooks/UPMC_cohort.ipynb)) \n", " Validation of primary findings in independent patient cohort from University of Pittsburgh ([Stansky et al.](http://www.sciencemag.org/content/333/6046/1157.full)).\n", " \n", "\n", "* [__Molecular_Validation__](./Molecular_Validation.ipynb) ([nbviewer](http://nbviewer.ipython.org/github/theandygross/TCGA/blob/master/Analysis_Notebooks/Molecular_Validation.ipynb)) \n", "Validation of molecular associations in recent TCGA samples.\n", "\n", "\n", "* [__PANCAN_cohort__](./PANCAN_cohort.ipynb) ([nbviewer](http://nbviewer.ipython.org/github/theandygross/TCGA/blob/master/Analysis_Notebooks/PANCAN_cohort.ipynb)) \n", " Validation of primary findings across ~4400 TCGA patient tumors. \n", " \n", "\n", "\n", "\n", " \n", "##Targeted Analysis for Support of Main Findings \n", "\n", "* [__Reviewer_Response__](./Reviewer_Response.ipynb) ([nbviewer](http://nbviewer.ipython.org/github/theandygross/TCGA/blob/master/Analysis_Notebooks/Reviewer_Response.ipynb)) \n", " Specific responses to reviewer comments.\n", " \n", "\n", "* [__HNSCC_clinical_characterization__](HNSCC_clinical_characterization.ipynb) ([nbviewer](http://nbviewer.ipython.org/github/theandygross/TCGA/blob/master/Analysis_Notebooks/HNSCC_clinical_characterization.ipynb)) \n", " Overview of clinical variables in the TCGA HNSCC cohort and their implications towards patient prognosis.\n", " \n", "\n", "* [__TP53_exploration__](./TP53_exploration.ipynb) ([nbviewer](http://nbviewer.ipython.org/github/theandygross/TCGA/blob/master/Analysis_Notebooks/TP53_exploration.ipynb)) \n", " Detailed characterization of TP53 mutations and their predicted functional impact. \n", " \n", " \n", "* [__HPV_characterization__](HPV_characterization.ipynb) ([nbviewer](http://nbviewer.ipython.org/github/theandygross/TCGA/blob/master/Analysis_Notebooks/HPV_characterization.ipynb)) \n", " Detailed characterization of the clinical and molecular coorelates of HPV+ status. \n", " \n", "\n", "* [__copy_number_exploration__](./copy_number_exploration.ipynb) ([nbviewer](http://nbviewer.ipython.org/github/theandygross/TCGA/blob/master/Analysis_Notebooks/copy_number_exploration.ipynb)) \n", " Exploration of chromosomal instability, 3p deletion, TP53 mutation and the relationships between these factors. \n", " \n", "\n", "* [__Clinical_Covariates__](./Clinical_Covariates.ipynb) ([nbviewer](http://nbviewer.ipython.org/github/theandygross/TCGA/blob/master/Analysis_Notebooks/Clinical_Covariates.ipynb)) \n", " Exploration of primary subtypes within the context of a number of clinical variables. \n", " \n", " \n", "* [__Multivariate_Modeling__](./Multivariate_Modeling.ipynb) ([nbviewer](http://nbviewer.ipython.org/github/theandygross/TCGA/blob/master/Analysis_Notebooks/Multivariate_Modeling.ipynb)) \n", " Exploration of primary subtypes within the context of a few different multivarite models including clinical variables.\n", " \n", " \n", " \n", "##Variant Calling (optional)\n", "\n", "This requires a number of additional dependencies for sequencing analysis and as well as function calls to proprietary software installed on our virtual machine hosed by Annai Systems. We have included all of the dependencies of this mutation calling step in the supplement as MAF files and highly recomend starting with these as opposed to recalling mutations. \n", "\n", " \n", "* [__muTect_streamline__](muTect_streamline.ipynb) ([nbviewer](http://nbviewer.ipython.org/github/theandygross/TCGA/blob/master/Analysis_Notebooks/muTect_streamline.ipynb)) \n", " This script is used to generate bash scripts to download and process additional TCGA data from CGHub. \n", " \n", " \n", "* [__new_data_process_TP53_Pancancer__](new_data_process_TP53_Pancancer.ipynb) ([nbviewer](http://nbviewer.ipython.org/github/theandygross/TCGA/blob/master/Analysis_Notebooks/new_data_process_TP53_Pancancer.ipynb)) \n", " Here we process the SNV and indel calls made by the variant calling tools, annotate them and consolidate them into a MAF file.\n", " " ] } ], "metadata": {} } ] }