{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Process PanCancer TP53 Mutation Calls" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This script is still a little dirty... aka hard paths and the like. To generate the calls we ran Mutect on the TCGA BAM files from CGhub. In general our calls are pretty agreeable with the TCGA working group calls despite (possibly) being called on slightly different software and having no manual curation step. We are using this to pick up about 2000 patients for our pan-cancer analysis, while our calls might not be perfect, we expect false-negatives rather than false-positives. If anything this should water down the p-values of our associations. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "cd ../src/" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "/cellar/users/agross/TCGA_Code/TCGA/src\n" ] } ], "prompt_number": 1 }, { "cell_type": "code", "collapsed": false, "input": [ "%pylab inline" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Populating the interactive namespace from numpy and matplotlib\n" ] } ], "prompt_number": 2 }, { "cell_type": "code", "collapsed": false, "input": [ "from Processing.Imports import *" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 3 }, { "cell_type": "code", "collapsed": false, "input": [ "PATH = '/cellar/data/TCGA/protected/exome_andy/p53_all/'" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 4 }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Read in Variant Calls" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Read in all of the VCF files and pull out the somatic TP53 SNVs." ] }, { "cell_type": "code", "collapsed": false, "input": [ "pts_snv = [f[:12] for f in os.listdir(PATH) if 'call_stats' in f and '~' not in f]\n", "vcf = pd.concat({p: pd.read_table(PATH + '{}_call_stats.txt'.format(p), skiprows=[0]) \n", " for p in pts_snv})\n", "vcf = vcf.reset_index()\n", "vcf = vcf.rename(columns={'level_0':'barcode'})\n", "vcf['barcode'] = vcf['barcode'].map(lambda s: s.replace('_','-'))\n", "pts_snv = map(lambda p: p.replace('_','-'), pts_snv)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 5 }, { "cell_type": "code", "collapsed": false, "input": [ "vcf = vcf[vcf.contig == 17]\n", "tab = vcf.groupby(['tumor_name','contig']).size().unstack()\n", "vcf = vcf[vcf.judgement == 'KEEP']" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 6 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Read in all of the indel calls. I have this funky try catch because some of the files have the VCF header but no actual calls, Pandas does not like this. For this reason I keep track of the patients with the pts_indels dict and the indel calls with the indels dict, and then put it all together." ] }, { "cell_type": "code", "collapsed": false, "input": [ "indels = {}\n", "pts_indels = {}\n", "for f in os.listdir(PATH):\n", " if not f.endswith('indels.txt'):\n", " continue\n", " pts_indels[f] = '-'.join(f.split('_')[:3])\n", " try:\n", " indels[pts_indels[f]] = pd.read_table(PATH + f, skiprows=102, header=None)\n", " except ValueError:\n", " pass\n", "pts_indels = pd.Series(pts_indels)\n", "indels = pd.concat(indels)\n", "indels.index.names = ['barcode','num']\n", "indels.columns = ['chromosome','pos','id','ref','alt','qual','filter','info',\n", " 'format','tumor','normal']\n", "indels = indels.reset_index(0)\n", "indels = indels[indels['info'] == 'SOMATIC']" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 7 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Not all variant calling runs were sucessfull. Here we are only using patients with a sucessfull SomaticIndelDetector run and a sucessfull MuTect run." ] }, { "cell_type": "code", "collapsed": false, "input": [ "pts = list(set(pts_indels).intersection(set(pts_snv)))\n", "len(pts), len(pts_indels), len(pts_snv)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 8, "text": [ "(5689, 5752, 5694)" ] } ], "prompt_number": 8 }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Annotate the Variants" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* I have not found a good automated way to do this, and the best solution is to upload the variants to the [Oncotator web site](http://www.broadinstitute.org/oncotator/)\n", "* They do have a REST API but it annotates about 2 variants per second, which takes a while with this many samples\n", "* The solution that I have emerged on it to save the variants to a temporar file in the format that Oncotator likes, upload that file to the website, and read in the oncotator output back into the program. \n", "* Yes, this is far from elegant, but works for now." ] }, { "cell_type": "code", "collapsed": false, "input": [ "def format_indels_for_oncotator(s):\n", " r = {}\n", " r['chr'] = s['chromosome']\n", " r['start'] = s['pos']\n", " r['end'] = int(s['pos']) + len(s['ref']) - 1\n", " r['reference_allele'] = s['ref']\n", " r['observed_allele'] = s['alt']\n", " return pd.Series(r)\n", "\n", "def format_snvs_for_oncotator(s):\n", " r = {}\n", " r['chr'] = s['contig']\n", " r['start'] = s['position']\n", " r['end'] = s['position']\n", " r['reference_allele'] = s['ref_allele']\n", " r['observed_allele'] = s['alt_allele']\n", " r = pd.Series(r)\n", " return r" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 9 }, { "cell_type": "code", "collapsed": false, "input": [ "onc_indel = indels.set_index('barcode').apply(format_indels_for_oncotator,1)\n", "onc_vcf = vcf.set_index('barcode').apply(format_snvs_for_oncotator, 1)\n", "onc = pd.concat([onc_indel, onc_vcf])\n", "onc = onc.reset_index('barcode')\n", "onc = onc[['chr', 'start', 'end', 'reference_allele', 'observed_allele']]\n", "onc.to_csv('tmp.txt', sep=' ', index=False, header=False)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 10 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Yes this is a hard path. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "ONCOTATOR_FILE = '/cellar/users/agross/Downloads/oncotator_output (9).txt'" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 11 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Oncotator only outputs annotations for unique variants. So I have to create a lookup for each mutation and then map it across the whole dataset. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "tab = pd.read_table(ONCOTATOR_FILE, skiprows=[0])\n", "tt = tab[['Chromosome','Start_position','End_position','Reference_Allele', 'Tumor_Seq_Allele1']]\n", "tt = pd.Series({i: tuple(v) for i,v in tt.iterrows()})" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 12 }, { "cell_type": "code", "collapsed": false, "input": [ "onc_indel = indels.set_index('barcode').apply(format_indels_for_oncotator,1)\n", "onc_vcf = vcf.set_index('barcode').apply(format_snvs_for_oncotator, 1)\n", "onc = pd.concat([onc_indel, onc_vcf])\n", "oo = onc[['chr','start','end','reference_allele','observed_allele']]\n", "oo = [tuple(v) for i,v in oo.iterrows()]\n", "maf = pd.DataFrame(tab.ix[tt[tt == s].index[0]] for s in oo)\n", "maf.index = onc.index\n", "maf = maf.set_index('Hugo_Symbol', append=True)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 13 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here I am turning the MAF file into a vector of patient TP53 mutation status." ] }, { "cell_type": "code", "collapsed": false, "input": [ "vc = maf.Variant_Classification.copy()\n", "vc.index = vc.index.droplevel(1)\n", "vc = (vc.isin(['Silent', 'Intron', \"3'UTR\", \"5'UTR\"])==False).groupby(level=0).sum()\n", "vc = vc.ix[pts].fillna(0) > 0\n", "vc.name = 'new'\n", "vc.value_counts()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 14, "text": [ "False 3348\n", "True 2341\n", "dtype: int64" ] } ], "prompt_number": 14 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Save the MAF file and the TP53 mutation status vector for later analysis. This is not a proper MAF file as I append the patients with variant calls, but wildtype TP53 to the end of the file." ] }, { "cell_type": "code", "collapsed": false, "input": [ "wildtype = pd.Series({(p, 'wildtype'):nan for p in vc[vc==False].index})\n", "wildtype.index = pd.MultiIndex.from_tuples(wildtype.index)\n", "wildtype = pd.DataFrame({'Entrez_Gene_Id': wildtype})\n", "maf_wt = maf.append(wildtype)[maf.columns]\n", "maf_wt = maf_wt.dropna(axis=1, how='all')" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 15 }, { "cell_type": "code", "collapsed": false, "input": [ "maf.to_csv('../Extra_Data/p53_calls_pancancer.maf')\n", "vc.to_csv('../Extra_Data/p53_calls_pancancer.csv')" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 16 }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Benchmark against TCGA Variant Calls" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Read in variant calls from MAFS. This meta.csv file is generated in the [get_all_MAFs](../Analysis_Notebooks/get_all_MAFs.ipynb) analysis notebook." ] }, { "cell_type": "code", "collapsed": false, "input": [ "old = pd.read_csv('../Data/MAFs/meta.csv')\n", "old['Tumor_Sample_Barcode'] = map(lambda s: s[:12], old['Tumor_Sample_Barcode'])" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 17 }, { "cell_type": "code", "collapsed": false, "input": [ "p53_old = old[old.Hugo_Symbol == 'TP53'].set_index('Tumor_Sample_Barcode')['0']\n", "p53_old = p53_old.ix[old.Tumor_Sample_Barcode.unique()].fillna(0)\n", "p53_old = p53_old.groupby(level=0).first()\n", "p53_old.name = 'old'" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 18 }, { "cell_type": "code", "collapsed": false, "input": [ "pd.crosstab(vc>0, p53_old>0)" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
oldFalseTrue
new
False 2103 90
True 79 1400
\n", "

2 rows \u00d7 2 columns

\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 19, "text": [ "old False True \n", "new \n", "False 2103 90\n", "True 79 1400\n", "\n", "[2 rows x 2 columns]" ] } ], "prompt_number": 19 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We get 94% sensitivity." ] }, { "cell_type": "code", "collapsed": false, "input": [ "1400. / (1400 + 90)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 20, "text": [ "0.9395973154362416" ] } ], "prompt_number": 20 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We get 96% specificity." ] }, { "cell_type": "code", "collapsed": false, "input": [ "2103. / (2103 + 79)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 21, "text": [ "0.963794683776352" ] } ], "prompt_number": 21 } ], "metadata": {} } ] }