{ "metadata": { "name": "", "signature": "sha256:24e50e2e41a3f52afde732750c09f1b4a2bf42ec86414405b0b59a13a389cd83" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "

Initialization Download and process all of the data fom Firehose

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "####Notebook Summary\n", "\n", "Here we are downloading and processing most of the necissary data to run this analysis pipeline. I have set up a series of scripts to do this in an automated fashon in order to allow for reproduction of this study by others as well as for updating the results obtained here as more TCGA data is collected and reseased. \n", "\n", "Downloading this data can take a considerable amount of time (~5 hours) and disk space (~45GB), be prepared. \n", "\n", "We use the firehose_get script provided by the Broad to download the data, please see the [firehose_get documentation](ttps://confluence.broadinstitute.org/display/GDAC/Download) for troubleshooting. As we are making using the Broad's initial processing pipeline and data formats we can not promise this initial code will not break upon future update that they make. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "%pylab inline" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Populating the interactive namespace from numpy and matplotlib\n" ] } ], "prompt_number": 1 }, { "cell_type": "code", "collapsed": false, "input": [ "cd ../src/" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "/cellar/users/agross/TCGA_Code/TCGA/src\n" ] } ], "prompt_number": 2 }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Parameters" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Change __OUT_PATH__ to directory on your machine where you want to store the data" ] }, { "cell_type": "code", "collapsed": false, "input": [ "OUT_PATH = '../Data'\n", "RUN_DATE = '2014_01_15'\n", "VERSION = 'all'\n", "CANCER = 'HNSC'\n", "FIGDIR = '../Figures'\n", "\n", "DESCRIPTION = '''Updating analysis for updated dataset.'''\n", "\n", "PARAMETERS = {'min_patients' : 12,\n", " 'pathway_file' : '../Extra_Data/c2.cp.v3.0.symbols_edit.csv'\n", " }" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 3 }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Initialization" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import pickle as pickle\n", "import pandas as pd\n", "import os as os \n", "\n", "from Data.Containers import Run\n", "\n", "from Data.Containers import get_run\n", "from Data.Containers import Cancer\n", "from Initialization.InitializeCN import initialize_cn\n", "from Initialization.InitializeReal import initialize_real\n", "from Initialization.InitializeMut import initialize_mut\n", "from Initialization.PreprocessMethylation import process_meth" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 4 }, { "cell_type": "code", "collapsed": false, "input": [ "from IPython import utils \n", "from IPython.display import HTML" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 5 }, { "cell_type": "code", "collapsed": false, "input": [ "css_file = 'profile_default/static/custom/custom.css'\n", "base = utils.path.get_ipython_dir()\n", "styles = \"\" % (open(os.path.join(base, css_file),'r').read())\n", "display(HTML(styles))" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "" ], "metadata": {}, "output_type": "display_data", "text": [ "" ] } ], "prompt_number": 6 }, { "cell_type": "code", "collapsed": false, "input": [ "!curl http://gdac.broadinstitute.org/runs/code/firehose_get_latest.zip -o fh_get.zip\n", "!unzip fh_get.zip" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ " % Total % Received % Xferd Average Speed Time Time Time Current\r\n", " Dload Upload Total Spent Left Speed\r\n", "\r", " 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", " 39 6542 39 2559 0 0 3641 0 0:00:01 --:--:-- 0:00:01 4783" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "100 6542 100 6542 0 0 8343 0 --:--:-- --:--:-- --:--:-- 10620\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Archive: fh_get.zip\r\n", "replace firehose_get? [y]es, [n]o, [A]ll, [N]one, [r]ename: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "^C\r\n" ] } ], "prompt_number": 7 }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Download data" ] }, { "cell_type": "code", "collapsed": false, "input": [ "d = 'http://gdac.broadinstitute.org/runs/analyses__{}/ingested_data.tsv'.format(RUN_DATE)\n", "tab = pd.read_table(d, sep='\\n', header=None)\n", "skip = tab[0].dropna().apply(lambda s: s.startswith('#'))\n", "skip = list(skip[skip==True].index)\n", "tab = pd.read_table(d, skiprows=skip, index_col=0).dropna()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "cancers = tab[tab.Clinical>0].index[:-1]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Takes about 5 min to download data." ] }, { "cell_type": "code", "collapsed": false, "input": [ "cancer_string = ' '.join(cancers)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "!firehose_get -b analyses $RUN_DATE $cancer_string > tmp" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "!firehose_get -b -o miR_gene_expression stddata $RUN_DATE $cancer_string > tmp" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "!firehose_get -b -o RSEM_genes_normalized stddata $RUN_DATE $cancer_string > tmp" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "!firehose_get -b -o rppa stddata $RUN_DATE $cancer_string > tmp" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "!firehose_get -b -o clinical stddata $RUN_DATE $cancer_string > tmp" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Takes about 25 min to download methylation data." ] }, { "cell_type": "code", "collapsed": false, "input": [ "!firehose_get -b -o humanmethylation450 stddata $RUN_DATE HNSC" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Clean Up Downloads" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "No going back from here, so I would check your data to make sure everything got downloaded correctly." ] }, { "cell_type": "code", "collapsed": false, "input": [ "!rm fh_get.zip\n", "!rm firehose_get" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "if not os.path.isdir(OUT_PATH):\n", " os.makedirs(OUT_PATH)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "analyses_folder = 'analyses__' + RUN_DATE\n", "!mv $analyses_folder {OUT_PATH + '/' + analyses_folder}" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "stddata_folder = 'stddata__' + RUN_DATE\n", "!mv $stddata_folder {OUT_PATH + '/' + stddata_folder}" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Exctract data and set up file hierarchy for downstream analysis" ] }, { "cell_type": "code", "collapsed": true, "input": [ "from Initialization.ProcessFirehose import process_all_cancers\n", "\n", "process_all_cancers(OUT_PATH, RUN_DATE)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Get rid of all of the downloaded zip files that we processed" ] }, { "cell_type": "code", "collapsed": false, "input": [ "!rm -rf {OUT_PATH + '/' + stddata_folder}\n", "!rm -rf {OUT_PATH + '/' + analyses_folder}" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "ls" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Create Run Object for Firehose Download" ] }, { "cell_type": "code", "collapsed": false, "input": [ "data_path = '{}/Firehose__{}/'.format(OUT_PATH, RUN_DATE)\n", "result_path = data_path + 'ucsd_analyses/'" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "cancer_codes = pd.read_table('../Extra_Data/diseaseStudy.txt',\n", " index_col=0, squeeze=True)\n", "run_dir = 'http://gdac.broadinstitute.org/runs'\n", "f = '{}/analyses__{}/ingested_data.tsv'.format(run_dir, RUN_DATE)\n", "sample_matrix = pd.read_table(f, index_col=0).dropna()\n", "sample_matrix = sample_matrix.ix[[c for c in sample_matrix.index if \n", " c not in ['PANCAN12', 'COADREAD','Totals']]]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "run = Run(RUN_DATE, VERSION, data_path, result_path, PARAMETERS, \n", " cancer_codes, sample_matrix, DESCRIPTION)\n", "run.save()\n", "run" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Initialize Data Objects into the File Hierarchy" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def init(c, run):\n", " try:\n", " cancer_obj = Cancer(c, run) \n", " cancer_obj.initialize_data(run, save=True)\n", " except:\n", " print c + '\\t' + 'all'\n", " try:\n", " initialize_real(c, run.report_path, 'mRNASeq', \n", " create_meta_features=True)\n", " except:\n", " print c + '\\t' + 'mRNASeq'\n", " try:\n", " initialize_real(c, run.report_path, 'RPPA', \n", " create_meta_features=True, create_real_features=False)\n", " except:\n", " print c + '\\t' + 'RPPA'\n", " try:\n", " initialize_real(c, run.report_path, 'miRNASeq', \n", " create_meta_features=False)\n", " except:\n", " print c + '\\t' + 'miRNASeq'\n", " try:\n", " initialize_cn(c, run.report_path, 'CN_broad')\n", " except:\n", " print c + '\\t' + 'CN' \n", " try:\n", " initialize_mut(c, run.report_path, create_meta_features=True);\n", " except:\n", " print c + '\\t' + 'mut' " ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": true, "input": [ "for cancer in run.cancers:\n", " init(cancer, run)" ], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }