{ "metadata": { "name": "", "signature": "sha256:105ea4bc1c4825d3c5e5b4c86f5782892f589165057981adade1b5894716f328" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "#Creating a VCF containing artifical variants for CFTR\n", "For some types of analysis, using empirical samples won't provide enough variation to discover all the edge cases. So we want to create a file that contains a huge number of different variant types.\n", "\n", "##Prequisite Libraries\n", "For this bit of data munging, we'll be using a few different Python libraries\n", "* [iPython](http://ipython.org/)\n", "* [Pandas](http://pandas.pydata.org/)\n", "* [HTSeq](http://www-huber.embl.de/users/anders/HTSeq/doc/overview.html)\n", "* [PyTables](http://www.pytables.org/)\n", "\n", "##Desired format\n", "VCF would be nice in the end, so we will construct our dataframe to closely model the VCF spec.\n", "\n", "For each positon we want to representent:\n", "\n", "1. All possible SNPs\n", "2. All possible 1 base insertions following that position\n", "3. Two possible 2 base insertions following that position\n", "4. Two possible 3 base insertions following that position\n", "5. The deletion of 1 bases following that position\n", "6. The deletion of 2 bases following that position\n", "7. The deletion of 3 bases following that position\n", "\n", "\n", "\n", "\t\n", "\t\t\n", "\t\n", "\t\n", " \n", "\t\n", " \n", " \n", "\t\n", " \n", " \n", "\t\n", " \n", " \n", "\t\n", " \n", " \n", "\t\n", " \n", " \n", "\t\n", " \n", " \n", "\t\n", " \n", " \n", "\t\n", " \n", " \n", "\t\n", " \n", " \n", "\t\n", " \n", " \n", "\t\n", " \n", " \n", "\t\n", " \n", " \n", "\t\n", " \n", " \n", "\t\n", "\n", "
#CHROMPOSIDREFALTQUALFILTERINFO
7117144307.CG.PASSDP=100
7117144307.CT.PASSDP=100
7117144307.CA.PASSDP=100
7117144307.CCA.PASSDP=100
7117144307.CCG.PASSDP=100
7117144307.CCC.PASSDP=100
7117144307.CCT.PASSDP=100
7117144307.CCAG.PASSDP=100
7117144307.CCTG.PASSDP=100
7117144307.CCCAT.PASSDP=100
7117144307.CCAGT.PASSDP=100
7117144307.CTC.PASSDP=100
7117144307.CTGC.PASSDP=100
7117144307.CTGGC.PASSDP=100
" ] }, { "cell_type": "code", "collapsed": false, "input": [ "#Import our libs up front\n", "import itertools\n", "import random\n", "import os\n", "import datetime\n", "import pandas as pd\n", "import numpy as np\n", "from HTSeq import FastaReader" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#Import Data\n", "First, we just need to get the table from Genome Browse into python. Since we we've already imported panadas, we can use it's parsing tools. But, we won't really use pandas the way it is intended to be used until later" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import_df = pd.read_csv(\"./EnsamblCFTR.tsv\", sep=\"\\t\")" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 2 }, { "cell_type": "code", "collapsed": false, "input": [ "tx_exon_dict = dict()\n", "# in general this is poor practice with pandas dataframes, but this df is very small\n", "# and a pure python data structure is more convienent\n", "for i in range(len(import_df)):\n", " exon_start_stop = zip(map(int, import_df[\"Exon Starts\"][i].split(\",\")), map(int, import_df[\"Exon Stops\"][i].split(\",\")))\n", " tx_exon_dict[import_df[\"Transcript Name\"][i]] = exon_start_stop" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 3 }, { "cell_type": "markdown", "metadata": {}, "source": [ "##Now we have our the ranges of our exons. Yay!\n", "Now we want to expand the start and stop by 100bp to make sure that we capture any variation caused by mutations in the immeadiate surrounding area.\n" ] }, { "cell_type": "code", "collapsed": false, "input": [ "expanded_tx_exon_dict = dict()\n", "for tx_name, exon_coordinate_list in tx_exon_dict.items():\n", " new_coordinate_list = []\n", " for start, stop in exon_coordinate_list:\n", " new_start = start - 100\n", " new_stop = stop + 100\n", " new_coordinate_list.append((new_start, new_stop))\n", " expanded_tx_exon_dict[tx_name] = new_coordinate_list" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 4 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Concatanate the intervals\n", "We really don't care about the transcript names. So the first thing we'll do is collapse our dict into an interval list." ] }, { "cell_type": "code", "collapsed": false, "input": [ "interval_list = []\n", "[interval_list.extend(x) for x in expanded_tx_exon_dict.itervalues()]\n", "print(interval_list)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[(117119916, 117120301), (117144206, 117144517), (117148987, 117149296), (117170852, 117171268), (117174229, 117174519), (117175201, 117175565), (117176501, 117176827), (117180053, 117180500), (117181969, 117182262), (117199417, 117199809), (117227692, 117227987), (117230306, 117230593), (117231887, 117232811), (117234883, 117235212), (117242779, 117243017), (117243485, 117243936), (117246627, 117246907), (117250472, 117250823), (117251534, 117251962), (117254566, 117254867), (117267475, 117267924), (117282391, 117282747), (117292795, 117293085), (117304641, 117305014), (117305412, 117305718), (117306861, 117308819), (117329666, 117330075), (117355711, 117356125), (117292796, 117293085), (117304641, 117305014), (117305412, 117305718), (117355711, 117356102), (117349970, 117350855), (117354124, 117354419), (117355711, 117356027), (117251570, 117251962), (117254566, 117254867), (117267475, 117268064), (117105737, 117105985), (117115644, 117115962), (117144206, 117144462), (117119916, 117120301), (117144206, 117144517), (117148987, 117149296), (117170852, 117171268), (117174229, 117174519), (117175201, 117175565), (117176501, 117176827), (117180053, 117180500), (117181969, 117182262), (117188594, 117188977), (117199417, 117199809), (117227692, 117227987), (117230306, 117230593), (117231887, 117232811), (117234883, 117235212), (117242779, 117243017), (117243485, 117243936), (117246627, 117246907), (117250472, 117250823), (117251534, 117251962), (117254566, 117254867), (117267475, 117267924), (117282391, 117282747), (117292795, 117293085), (117304641, 117305014), (117305412, 117305718), (117306861, 117308815), (117199591, 117199809), (117204622, 117204941), (117227692, 117227903), (117350085, 117350484), (117350564, 117350855), (117352294, 117352558), (117354124, 117354419), (117355711, 117355981), (117119257, 117119499), (117119415, 117119848), (117144206, 117144517), (117148987, 117149296), (117170852, 117171132), (117120048, 117120301), (117144206, 117144517), (117148987, 117149296), (117170852, 117171268), (117175201, 117175565), (117176501, 117176827), (117180053, 117180500), (117181969, 117182262), (117188594, 117188977), (117199417, 117199809), (117227692, 117227987), (117230306, 117230593), (117231887, 117232811), (117234883, 117235212), (117242779, 117243017), (117243485, 117243936), (117246627, 117246907), (117250472, 117250823), (117251534, 117251962), (117254566, 117254867), (117267475, 117267924), (117282391, 117282747), (117292795, 117293085), (117304641, 117305014), (117305412, 117305718), (117306861, 117307225)]\n" ] } ], "prompt_number": 5 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Collapse overlapping intervals\n", "Up to this point, we haven't done anything too interesting. Well put on your algorithm helmet and get ready!\n", "\n", "What we want it the minimum set of intervals that contain all the intervals in our raw list." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# sort by the start position\n", "sorted_by_start = sorted(interval_list, key=lambda x: x[0])" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 6 }, { "cell_type": "code", "collapsed": false, "input": [ "def collapse_intervals(intervals):\n", " cur_start, cur_stop = intervals[0]\n", " for next_start, next_stop in intervals[1:]:\n", " if cur_stop < next_start:\n", " yield cur_start, cur_stop\n", " cur_start, cur_stop = next_start, next_stop\n", " else:\n", " cur_stop = next_stop\n", " yield cur_start, cur_stop" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 7 }, { "cell_type": "markdown", "metadata": {}, "source": [ "##Tests are appreciated\n", "Even with something pretty straight forward, it's good to make sure that the function works as you'd expect" ] }, { "cell_type": "code", "collapsed": false, "input": [ "test_intervals = [(1,4), (2,5), (5,7), (10,13)] #min set [(1,7), (10,13)]\n", "test_intervals2 = [(1,4), (2,5), (6,7), (10,13), (12,15)] #min set [(1,5), (6,7), (10,15)]\n", "print(list(collapse_intervals(test_intervals)) == [(1,7), (10,13)])\n", "print(list(collapse_intervals(test_intervals2)) == [(1,5), (6,7), (10,15)])" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "True\n", "True\n" ] } ], "prompt_number": 8 }, { "cell_type": "code", "collapsed": false, "input": [ "collapsed_interval_list = list(collapse_intervals(sorted_by_start))\n", "print(\"Before collapsing we have: \" + str(len(interval_list)) + \" intervals\")\n", "print(\"After collapsing we have: \" + str(len(collapsed_interval_list)) + \" intervals\")\n", "print(\"\\nOur new interval set:\\n\")\n", "print(collapsed_interval_list)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Before collapsing we have: 107 intervals\n", "After collapsing we have: 37 intervals\n", "\n", "Our new interval set:\n", "\n", "[(117105737, 117105985), (117115644, 117115962), (117119257, 117119848), (117119916, 117120301), (117144206, 117144517), (117148987, 117149296), (117170852, 117171268), (117174229, 117174519), (117175201, 117175565), (117176501, 117176827), (117180053, 117180500), (117181969, 117182262), (117188594, 117188977), (117199417, 117199809), (117204622, 117204941), (117227692, 117227987), (117230306, 117230593), (117231887, 117232811), (117234883, 117235212), (117242779, 117243017), (117243485, 117243936), (117246627, 117246907), (117250472, 117250823), (117251534, 117251962), (117254566, 117254867), (117267475, 117267924), (117282391, 117282747), (117292795, 117293085), (117304641, 117305014), (117305412, 117305718), (117306861, 117307225), (117329666, 117330075), (117349970, 117350484), (117350564, 117350855), (117352294, 117352558), (117354124, 117354419), (117355711, 117355981)]\n" ] } ], "prompt_number": 9 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create Variants\n", "We need to have quick access to possible variants. We will be createing all SNPs and 1bp insertions. To keep the dataset manageable we'll only use 2 of the 2bp and 2 of 3bp options" ] }, { "cell_type": "code", "collapsed": false, "input": [ "BASES = ['A', 'G', 'C', 'T']\n", "TWOBASES = [ ''.join(combo) for combo in itertools.permutations(BASES, 2)]\n", "THREEBASES = [ ''.join(combo) for combo in itertools.permutations(BASES, 3)]\n", "print BASES\n", "print TWOBASES\n", "print THREEBASES" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "['A', 'G', 'C', 'T']\n", "['AG', 'AC', 'AT', 'GA', 'GC', 'GT', 'CA', 'CG', 'CT', 'TA', 'TG', 'TC']\n", "['AGC', 'AGT', 'ACG', 'ACT', 'ATG', 'ATC', 'GAC', 'GAT', 'GCA', 'GCT', 'GTA', 'GTC', 'CAG', 'CAT', 'CGA', 'CGT', 'CTA', 'CTG', 'TAG', 'TAC', 'TGA', 'TGC', 'TCA', 'TCG']\n" ] } ], "prompt_number": 10 }, { "cell_type": "markdown", "metadata": {}, "source": [ "##Creating rows\n", "To keep things in organized we'll create a function that returns a list of the variants we want to seed at that position. We'll create all the SNPs and the insertions following the current base we are working on.\n", "\n", "The fields we want in each row are:\n", "\n", "* \\#CHROM\n", "* POS\n", "* ID\n", "* REF\n", "* ALT\n", "* QUAL\n", "* FILTER\n", "* INFO\n", "* FORMAT\n", "* SAMPLE\n", "\n", "Since ID, QUAL, FILTER, and INFO don't really exist as these variants didn't come from real sequence data, we'll just allow them to be missing and set the missing value on export. We can hard code in our FORMAT and SAMPLE column to contain the genotype information for this sample.\n", "\n", "In order to get the reference allele at each position, we downloaded the hg19 chr7 fasta file from http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr7.fa.gz. You could easily do some simple file and string operations to get the reference from the position out of this file, but I chose to use the HTSeq library, so that I can pass the gzip file in directly and not deal with iterating over lines." ] }, { "cell_type": "code", "collapsed": false, "input": [ "#treat the fasta reader like an iterator and get the first item\n", "chr7_seq = FastaReader(\"./chr7.fa.gz\").__iter__().next()\n", "\n", "def get_variants(chr, pos):\n", " variant_list = []\n", " #subtract 1 because python lists are 0-indexed; add 3 b/c python list slicing is exclusive of upper index\n", " ref = chr7_seq[pos-1:pos+3].seq.upper()\n", " variant_list.extend(get_snps(chr, pos, ref))\n", " variant_list.extend(get_insertions(chr, pos, ref))\n", " variant_list.extend(get_deletions(chr, pos, ref))\n", " return variant_list\n", "\n", "def get_snps(chr, pos, ref):\n", " snp_list = []\n", " ref = ref[0]\n", " for base in BASES:\n", " if base == ref: continue\n", " snp_list.append({'#CHROM': chr, 'POS': pos, 'REF': ref, 'ALT': base})\n", " return snp_list\n", "\n", "def get_insertions(chr, pos, ref):\n", " ins_list = []\n", " ref = ref[0]\n", " for base in itertools.chain(BASES, random.sample(TWOBASES, 2), random.sample(THREEBASES, 2)):\n", " ins_list.append({'#CHROM': chr, 'POS': pos,'REF': ref[0], 'ALT': ref[0] + base})\n", " return ins_list\n", "\n", "def get_deletions(chr, pos, ref):\n", " del_list = []\n", " del_list.append({'#CHROM': chr, 'POS': pos, 'REF': ref[0:2], 'ALT': ref[0]})\n", " del_list.append({'#CHROM': chr, 'POS': pos, 'REF': ref[0:3], 'ALT': ref[0]})\n", " del_list.append({'#CHROM': chr, 'POS': pos, 'REF': ref, 'ALT': ref[0]})\n", " return del_list\n", " " ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 11 }, { "cell_type": "code", "collapsed": false, "input": [ "col_names= ['#CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER','INFO','FORMAT','SAMPLE']\n", "variant_df = pd.DataFrame(columns=col_names)\n", "print variant_df\n", "for interval in collapsed_interval_list:\n", " pos, stop = interval\n", " while(pos < stop):\n", " var_dict = get_variants(7, pos)\n", " variant_df = variant_df.append(var_dict)\n", " pos += 1\n", "#hard code in our genotype information\n", "variant_df['FORMAT'] = 'GT'\n", "variant_df['SAMPLE'] = '0/1'" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Empty DataFrame\n", "Columns: [#CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, FORMAT, SAMPLE]\n", "Index: []\n" ] } ], "prompt_number": 12 }, { "cell_type": "markdown", "metadata": {}, "source": [ "##Let's check out the table" ] }, { "cell_type": "code", "collapsed": false, "input": [ "variant_df.head()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
#CHROMPOSIDREFALTQUALFILTERINFOFORMATSAMPLE
0 7 117105737NaN C ANaNNaNNaN GT 0/1
1 7 117105737NaN C GNaNNaNNaN GT 0/1
2 7 117105737NaN C TNaNNaNNaN GT 0/1
3 7 117105737NaN C CANaNNaNNaN GT 0/1
4 7 117105737NaN C CGNaNNaNNaN GT 0/1
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 13, "text": [ " #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE\n", "0 7 117105737 NaN C A NaN NaN NaN GT 0/1\n", "1 7 117105737 NaN C G NaN NaN NaN GT 0/1\n", "2 7 117105737 NaN C T NaN NaN NaN GT 0/1\n", "3 7 117105737 NaN C CA NaN NaN NaN GT 0/1\n", "4 7 117105737 NaN C CG NaN NaN NaN GT 0/1" ] } ], "prompt_number": 13 }, { "cell_type": "markdown", "metadata": {}, "source": [ "##Create VCF\n", "Now we'd like to create a VCF from our dataframe. Since this is a one off vcf file, we'll just manually construct a header and then append our column headers and dataframe values." ] }, { "cell_type": "code", "collapsed": false, "input": [ "vcf_header = \"\"\"##fileformat=VCFv4.1\n", "##fileDate={0}\n", "##source=CFTR_Variant_Project\n", "##FORMAT=\n", "\"\"\"\n", "vcf_header = vcf_header.format(datetime.datetime.utcnow().strftime(\"%Y%m%d\"))\n", "'\\n'.join([vcf_header, ('\\t'.join(variant_df.columns)), '\\t'.join(col_names)])" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 19, "text": [ "'##fileformat=VCFv4.1\\n##fileDate=20140501\\n##source=CFTR_Variant_Project\\n##FORMAT=\\n\\n#CHROM\\tPOS\\tID\\tREF\\tALT\\tQUAL\\tFILTER\\tINFO\\tFORMAT\\tSAMPLE\\n#CHROM\\tPOS\\tID\\tREF\\tALT\\tQUAL\\tFILTER\\tINFO\\tFORMAT\\tSAMPLE'" ] } ], "prompt_number": 19 }, { "cell_type": "code", "collapsed": false, "input": [ "output_file = \"CFTR_artifical_variant.vcf\"\n", "try:\n", " os.remove(output_file)\n", "except OSError:\n", " pass\n", "with open(output_file, 'a') as f:\n", " f.write(vcf_header)\n", " variant_df.to_csv(f, sep='\\t', na_rep='.', index=False)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 15 }, { "cell_type": "markdown", "metadata": {}, "source": [ "##Bonus points!\n", "We might want to come back and play with our dataframe without having to regenerate all this data. HDF5 is a great way to store Pandas dataframes." ] }, { "cell_type": "code", "collapsed": false, "input": [ "store_file=\"store.h5\"\n", "try:\n", " os.remove(store_file)\n", "except OSError:\n", " pass\n", "store = pd.HDFStore(store_file)\n", "store['variant_df'] = variant_df" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 16 }, { "cell_type": "markdown", "metadata": {}, "source": [ "#What does this vcf look like?\n", "If we import this vcf into [Genome Browse](http://genomebrowse.com), we can see our worke. First, notice in the zoomed out view that our variants have covered all of the exonic regions in all the the transcripts in the Ensembl database for CFTR. Second, upon zooming in we can see our deletions of various sizes in pink, our SNPs in the midle multi-color blocks, and our insertions in the small orange bars in the lower part of the view.\n", "\n", "![Zoomed Out](vcf.png)\n", "![Exon 14](vcf2.png)" ] } ], "metadata": {} } ] }