{
 "metadata": {
  "name": "",
  "signature": "sha256:4a98a2d007628f3bf9afe460c58594452d2326c4e2ffdf46e6e557748febb756"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "Run Variant Calls on TCGA BAM Files"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Here I am generating a series of bash scripts to run MuTect and SomaticIndelDetector on TCGA samples in CGHub.  I'm including it for completeness, but this part is probably not one size fits all due to a number of hardware dependencies, ect. \n",
      "\n",
      "We are running the variant calling on a virtual machine hosted by [Annai Systems](http://www.annaisystems.com/) which is co-located with the data at the San Diego Supercomputing Center.  Most of the pipeline should be standard, but we do use Annai's proprietary GTFuse protocal for moving the data, which allows us to move BAM files in a couple of seconds as compared to hours. To this effect, if you are not using this system, you will need to modify the build_call function to move and clean up the data.  \n",
      "\n",
      "There are a lot of globals here.  I should probably move this into a class at some point. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%pylab inline"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Populating the interactive namespace from numpy and matplotlib\n"
       ]
      }
     ],
     "prompt_number": 1
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import pandas as pd\n",
      "import os as os"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 2
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Set Global Variables"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "File requirements for variant calling \n",
      "\n",
      "* MuTect 1.1.5 jar\n",
      "* GATK with SomaticIndelDetector (version 2.2-2)\n",
      "* dbnp vcf file\n",
      "* cosmic vcf file\n",
      "* reference genome \n",
      "* The variant files and the reference genome should all be for the same reference \n",
      "* This can all be downloaded as part of the nice [bcbio-nextgen](https://github.com/chapmanb/bcbio-nextgen) package."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "CGHub manifest.  I'm linking to an old one to keep the analysis set constant, but feel free to update.  "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "cghub_manifest = 'https://cghub.ucsc.edu/reports/SUMMARY_STATS/2014-02-24T12:00:01-0800_data_manifest.tsv'"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 3
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "CGHUB download url"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "CGHUB = 'https://cghub.ucsc.edu/cghub/data/analysis/download'"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 4
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Here I'm using GRCh37 for all of these files."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "DBSNP = '/home/moores/projects/pipeline/files/variation/dbsnp_138.vcf'\n",
      "COSMIC = '/home/moores/projects/pipeline/files/variation/cosmic-v67_20131024-GRCh37.vcf'\n",
      "REFERENCE = '/home/moores/projects/pipeline/files/reference/GRCh37.fa'"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 5
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Path to key for downloading data from CGHub.  You need to contact CGHub to get a key."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "KEY = '/home/moores/cghub.key'"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 6
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Path to MuTect.  I was getting bugs with version 1.1.4 so make sure you use this version."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "MUTECT_JAR = '/usr/local/share/java/mutect/muTect-1.1.5.jar'"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 7
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Path to SomaticIndelDetector jar. This is in a specific version of GATK, so make sure you are using version 2.2-2."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "SID_JAR = '/home/moores/projects/pipeline/progs/GenomeAnalysisTK-2.2-2/GenomeAnalysisTK.jar'"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 8
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Path to cache directory.  This is important because the bam files can pile up in the cash and eat up all of the space on a hard drive very quickly. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "CACHE = '/home/moores/cache'"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 9
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Number of processes you want to spawn with the variant calling.  I'm taking a quick and dirty pass here and just running a bunch of bash scripts simultaniously.  At some point we will probably switch to a scheduler, but we don't have one on our Annai VM currently and this seems to be working ok for now."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "NUM_PROCESSES = 16"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 10
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Directory to store the data on whatever machine you are running the scripts on (we use a VM)."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "VM_DIRECTORY = '/home/moores/projects/recall'"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 11
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Local directory to spit out the bash scripts."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "LOCAL_DIRECTORY = '/cellar/users/agross/scripts'"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 12
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Form Mutation Calling Pipeline Script"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Some shortcuts to make reading the program easier."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "mutect = 'java -Xmx32g -jar {}'.format(MUTECT_JAR)\n",
      "mutect_args = {'analysis_type': 'MuTect',\n",
      "               'reference_sequence': REFERENCE,\n",
      "               'dbsnp' : DBSNP,\n",
      "               'cosmic': COSMIC,\n",
      "               }\n",
      "\n",
      "somaticIndelDetector = 'java -Xmx32g -jar {}'.format(SID_JAR)\n",
      "indel_args = {'analysis_type': 'SomaticIndelDetector',\n",
      "              'reference_sequence': REFERENCE,\n",
      "              }\n",
      "gt_fuse = 'gtfuse -c {} --inactivity-timeout=2'.format(KEY)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 13
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "__get_regions__ function that is used to build the intervals to feed GATK."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "regions = pd.read_csv('../Extra_Data/hgnc_regions.txt')\n",
      "regions = regions.dropna().set_index('HGNC symbol')\n",
      "regions = regions.sort('Chromosome Name').groupby(level=0).first()\n",
      "def get_region(g):\n",
      "    '''\n",
      "    Given a gene, return the chromosomal in a format that GATK likes.\n",
      "    '''\n",
      "    r = regions.ix[g]\n",
      "    return '{}:{}-{}'.format(r['Chromosome Name'], r['Gene Start (bp)'] - 200,\n",
      "                              r['Gene End (bp)'] + 200)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 14
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Big function to build the call to move the data around and actually run the variant calling programs on the VM.  "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def build_call(pat, genes, normal_code='NB', tumor_code='TP', directory='', out=''):\n",
      "    '''\n",
      "    genes is a list of genes: ['HRAS','TP53']\n",
      "    directory is like: /home/moores/projects/recall\n",
      "    out is a folder to output the data.\n",
      "    \n",
      "    One caveat to remember with the argslists is that the normal input file \n",
      "    needs to be fed in before the tumor input file.\n",
      "    '''\n",
      "    pat_ = pat.replace('-','_')\n",
      "    \n",
      "    normal_id = tcga.ix[pat, normal_code]['analysis_id']\n",
      "    normal_ext = tcga.ix[pat, normal_code]['filename']\n",
      "    tumor_id = tcga.ix[pat, tumor_code]['analysis_id']\n",
      "    tumor_ext = tcga.ix[pat, tumor_code]['filename']\n",
      "    \n",
      "    normal_folder = '{}/normal_{}'.format(directory, pat_)\n",
      "    init_normal = 'mkdir {}'.format(normal_folder) \n",
      "    mount_normal = '{} {}/{} {}'.format(gt_fuse, CGHUB, normal_id, normal_folder)\n",
      "    \n",
      "    tumor_folder = '{}/tumor_{}'.format(directory, pat_)\n",
      "    init_tumor = 'mkdir {}'.format(tumor_folder)\n",
      "    mount_tumor = '{} {}/{} {}'.format(gt_fuse, CGHUB, tumor_id, tumor_folder)\n",
      "    \n",
      "    normal_f = '{}/{}/{}'.format(normal_folder, normal_id, normal_ext)\n",
      "    tumor_f = '{}/{}/{}'.format(tumor_folder, tumor_id, tumor_ext)    \n",
      "    \n",
      "    '''Call Mutect'''\n",
      "    mutect_args['out'] = '{}/{}/{}_call_stats.txt'.format(directory, out, pat_)\n",
      "    mutect_args['performanceLog'] = '{}/{}/{}_mutect_log.txt'.format(directory, out, pat_)\n",
      "    mutect_args['coverage_file'] = '{}/{}/{}_overage.wig.txt'.format(directory, out, pat_)\n",
      "    \n",
      "    arglist  = [''] + [k + ' ' + v for k,v in mutect_args.iteritems()]\n",
      "    arglist += ['input_file:normal ' + normal_f, 'input_file:tumor ' + tumor_f] \n",
      "    arglist += ['intervals ' + get_region(g) for g in genes]\n",
      "    mutect_call = mutect + ' --'.join(arglist)\n",
      "    \n",
      "    '''Call SomaticIndelLocator'''\n",
      "    indel_args['out'] = '{}/{}/{}_indels.txt'.format(directory, out, pat_)\n",
      "    indel_args['performanceLog'] = '{}/{}/{}_indel_log.txt'.format(directory, out, pat_)\n",
      "    arglist  = [''] + [k + ' ' + v for k,v in indel_args.iteritems()]\n",
      "    arglist += ['input_file:normal ' + normal_f, 'input_file:tumor ' + tumor_f] \n",
      "    arglist += ['intervals ' + get_region(g) for g in genes]\n",
      "    indel_call = somaticIndelDetector + ' --'.join(arglist)\n",
      "    \n",
      "    \n",
      "    unmount_normal = 'fusermount -u {}'.format(normal_folder) \n",
      "    clear_normal_cache = 'rm -rf {}/{}*'.format(CACHE, normal_id)\n",
      "    clean_normal = 'rm -rf {}'.format(normal_folder)\n",
      "    \n",
      "    unmount_tumor = 'fusermount -u {}'.format(tumor_folder) \n",
      "    clear_tumor_cache = 'rm -rf {}/{}*'.format(CACHE, tumor_id)\n",
      "    clean_tumor = 'rm -rf {}'.format(tumor_folder)\n",
      "    \n",
      "    r = ' ; '.join([init_normal, mount_normal, \n",
      "                    init_tumor, mount_tumor, \n",
      "                    mutect_call, indel_call,\n",
      "                    unmount_normal, clear_tumor_cache, clean_normal, \n",
      "                    unmount_tumor, clear_tumor_cache, clean_tumor])\n",
      "    return r"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 15
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Prepare Parallel Call for Running on VM"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Generate script run the variant calls."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def generate_scripts(patients, genes, out):\n",
      "    '''\n",
      "    Parses calls out to one script for each process you want to run.\n",
      "    Then builds a driver script to run them all in parallel.\n",
      "    Saves the files to the output directory.  The threads are names \n",
      "    thread_n.sh and the driver is named driver.sh.\n",
      "    '''\n",
      "    script_dir = '{}/{}'.format(LOCAL_DIRECTORY, out)\n",
      "    if not os.path.isdir(script_dir):\n",
      "        os.makedirs(script_dir)\n",
      "    \n",
      "    for i in range(NUM_PROCESSES):\n",
      "        calls = ' ; '.join([build_call(pat, genes, directory=VM_DIRECTORY, out=out)\n",
      "                            for pat in patients[i::NUM_PROCESSES]])\n",
      "        f = open('{}/thread_{}.sh'.format(script_dir, i), 'wb')\n",
      "        f.write(calls)\n",
      "        f.close()\n",
      "        \n",
      "    super_script = 'mkdir {}/{}\\n'.format(VM_DIRECTORY, out)\n",
      "    threads = ['bash thread_{}.sh &'.format(i) for i in range(NUM_PROCESSES)]\n",
      "    super_script += '\\n'.join(threads)\n",
      "    \n",
      "    f = open('{}/driver.sh'.format(script_dir), 'wb')\n",
      "    f.write(super_script)\n",
      "    f.close()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 16
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Prepare patient lists for big variant calling jobs. "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Read Manifest form CGHub"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "tcga = pd.read_table(cghub_manifest, low_memory=False)\n",
      "tcga = tcga[tcga.study == 'TCGA']\n",
      "tcga = tcga[tcga.library_type =='WXS']\n",
      "tcga = tcga.sort('uploaded')\n",
      "tcga['patient'] = tcga['barcode'].map(lambda s: s[:12])\n",
      "tcga = tcga[tcga.state == 'Live']\n",
      "tcga = tcga.set_index(['patient', 'sample_type'])\n",
      "tcga = tcga.groupby(level=[0,1]).last()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 17
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Form tissue type lists"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "pats = set(tcga.index.get_level_values(0))\n",
      "\n",
      "tumor = tcga.xs('TP', level='sample_type')\n",
      "tumor = set(tumor.index.get_level_values(0))\n",
      "\n",
      "normal = tcga.xs('NB', level='sample_type')\n",
      "normal = set(normal.index.get_level_values(0))\n",
      "\n",
      "normal_tissue = tcga.xs('NT', level='sample_type')\n",
      "normal_tissue = set(normal_tissue.index.get_level_values(0))\n",
      "\n",
      "metastatic = tcga.xs('TM', level='sample_type')\n",
      "metastatic = set(metastatic.index.get_level_values(0))\n",
      "\n",
      "blood = tcga.xs('TB', level='sample_type')\n",
      "blood = set(blood.index.get_level_values(0))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 18
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Compile lists of matched samples"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "tn_blood  = [i for i in pats if i in tumor and i in normal]\n",
      "tn_tissue  = [i for i in pats if i in tumor and i in normal_tissue]\n",
      "met_norm = [i for i in pats if i in normal and i in metastatic]\n",
      "blood_tumor = [i for i in pats if i in normal_tissue and i in blood]\n",
      "\n",
      "len(pats), len(tn_blood), len(tn_tissue), len(met_norm), len(blood_tumor)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 19,
       "text": [
        "(8413, 6428, 1839, 302, 150)"
       ]
      }
     ],
     "prompt_number": 19
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Generate Scripts for Pan-Cancer TP53 Targeted Variant Calling"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "* Here we are saving the scripts in the p53_all directory\n",
      "* We are running this for all patients with matched solid tumor / normal blood samples"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "generate_scripts(tn_blood, ['TP53'], 'p53_all2')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 20
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Generate Scripts for Targeted Analysis of HNSCC Genes"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "hnsc = tcga[tcga.disease == 'HNSC']\n",
      "hnsc = hnsc.ix[tn_blood]\n",
      "hnsc_pts = hnsc.index.get_level_values(0).unique()\n",
      "hnsc_pts.shape"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 21,
       "text": [
        "(463,)"
       ]
      }
     ],
     "prompt_number": 21
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "sos_signaling = ['GRB2', 'HRAS', 'IRS1', 'IRS2', 'KRAS', 'MAP2K1', 'MAP2K2', \n",
      "                 'MAPK1', 'MAPK3', 'NRAS', 'RAF1', 'SOS1', 'YWHAB']\n",
      "genes = sos_signaling + ['TP53', 'MUC5B', 'CASP8', 'ASPM', 'EGFR']"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 22
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "generate_scripts(hnsc_pts, genes, 'HNSCC_validation')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 23
    }
   ],
   "metadata": {}
  }
 ]
}