{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "##Rosalind: Counting DNA Nucleotides##" ] }, { "cell_type": "code", "collapsed": false, "input": [ "dna_string = 'AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC'" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will use a [dictionary](http://docs.python.org/2/library/stdtypes.html#mapping-types-dict) to keep track of nucleotide counts. Using dictionaries effectively is one of the keys to getting the most out of Python." ] }, { "cell_type": "code", "collapsed": false, "input": [ "counts = {'A': 0,\n", " 'C': 0,\n", " 'G': 0,\n", " 'T': 0,\n", " }" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 2 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Iterating over a string produces the characters of the string one at a time." ] }, { "cell_type": "code", "collapsed": false, "input": [ "for c in dna_string:\n", " counts[c] += 1" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 3 }, { "cell_type": "markdown", "metadata": {}, "source": [ "List comprehensions are a powerful way to construct lists by peforming some operation on each of the elements in an object that can be iterated over." ] }, { "cell_type": "code", "collapsed": false, "input": [ "count_strings = [str(counts[base]) for base in 'ACGT']" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 4 }, { "cell_type": "code", "collapsed": false, "input": [ "print ' '.join(count_strings)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "20 12 17 21\n" ] } ], "prompt_number": 5 }, { "cell_type": "markdown", "metadata": {}, "source": [ "##Rosalind: Transcribing DNA into RNA##" ] }, { "cell_type": "code", "collapsed": false, "input": [ "dna_string = 'GATGGAACTTGACTACGTAAAT'" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 6 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can't directly change the characters in the string - strings are immutable. \n", "(If you aren't familiar with the `enumerate` built-in function, check out it's documentation.)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "for i, c in enumerate(dna_string):\n", " if c == 'T':\n", " dna_string[i] = 'U'" ], "language": "python", "metadata": {}, "outputs": [ { "ename": "TypeError", "evalue": "'str' object does not support item assignment", "output_type": "pyerr", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[1;31mTypeError\u001b[0m Traceback (most recent call last)", "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m()\u001b[0m\n\u001b[0;32m 1\u001b[0m \u001b[1;32mfor\u001b[0m \u001b[0mi\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mc\u001b[0m \u001b[1;32min\u001b[0m \u001b[0menumerate\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mdna_string\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 2\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0mc\u001b[0m \u001b[1;33m==\u001b[0m \u001b[1;34m'T'\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 3\u001b[1;33m \u001b[0mdna_string\u001b[0m\u001b[1;33m[\u001b[0m\u001b[0mi\u001b[0m\u001b[1;33m]\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;34m'U'\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[1;31mTypeError\u001b[0m: 'str' object does not support item assignment" ] } ], "prompt_number": 7 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We could build the string up incrementally ... " ] }, { "cell_type": "code", "collapsed": false, "input": [ "rna_string = ''\n", "for c in dna_string:\n", " if c == 'T':\n", " rna_string += 'U'\n", " else:\n", " rna_string += c\n", " \n", "print rna_string" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "GAUGGAACUUGACUACGUAAAU\n" ] } ], "prompt_number": 8 }, { "cell_type": "markdown", "metadata": {}, "source": [ "... but string objects have a `replace()` method that does exactly what we want." ] }, { "cell_type": "code", "collapsed": false, "input": [ "rna_string_2 = dna_string.replace('T', 'U')\n", "print rna_string == rna_string_2" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "True\n" ] } ], "prompt_number": 9 }, { "cell_type": "markdown", "metadata": {}, "source": [ "##Rosalind: Complementing a Strand of DNA##" ] }, { "cell_type": "code", "collapsed": false, "input": [ "dna_string = 'AAAACCCGGT'" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 10 }, { "cell_type": "markdown", "metadata": {}, "source": [ "One approach is to use a dictionary to encode how each nucleotide gets complemented and to use the built-in function reversed() to iterate over the DNA string in reversed order." ] }, { "cell_type": "code", "collapsed": false, "input": [ "complement = {'T': 'A',\n", " 'C': 'G',\n", " 'A': 'T',\n", " 'G': 'C',\n", " }\n", "\n", "rc_string = ''\n", "for c in reversed(dna_string):\n", " rc_string += complement[c]\n", " \n", "print rc_string" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "ACCGGGTTTT\n" ] } ], "prompt_number": 11 }, { "cell_type": "markdown", "metadata": {}, "source": [ "It turns out that there are built-in functions in the `string` module that do exactly what we want." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import string" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 12 }, { "cell_type": "code", "collapsed": false, "input": [ "complement_table = string.maketrans('TCAG', 'AGTC')\n", "c_string = dna_string.translate(complement_table)\n", "rc_string_2 = c_string[::-1]\n", "\n", "print rc_string == rc_string_2" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "True\n" ] } ], "prompt_number": 13 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Why does [::-1] work?" ] }, { "cell_type": "code", "collapsed": false, "input": [ "up_to_ten = range(10)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 14 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Taking a slice a:b returns the interval of the list starting at index a and going up to but not including index b. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "up_to_ten[2:4]" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 15, "text": [ "[2, 3]" ] } ], "prompt_number": 15 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Leaving off part of the slice means 'start from the beginning'..." ] }, { "cell_type": "code", "collapsed": false, "input": [ "up_to_ten[:4]" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 16, "text": [ "[0, 1, 2, 3]" ] } ], "prompt_number": 16 }, { "cell_type": "markdown", "metadata": {}, "source": [ "... or 'go to the end.'" ] }, { "cell_type": "code", "collapsed": false, "input": [ "up_to_ten[4:]" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 17, "text": [ "[4, 5, 6, 7, 8, 9]" ] } ], "prompt_number": 17 }, { "cell_type": "markdown", "metadata": {}, "source": [ "An optional third slice argument gives a stride length to step along the interval with." ] }, { "cell_type": "code", "collapsed": false, "input": [ "up_to_ten[2:9:3]" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 18, "text": [ "[2, 5, 8]" ] } ], "prompt_number": 18 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Stride length can be negative...." ] }, { "cell_type": "code", "collapsed": false, "input": [ "up_to_ten[9:2:-2]" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 19, "text": [ "[9, 7, 5, 3]" ] } ], "prompt_number": 19 }, { "cell_type": "markdown", "metadata": {}, "source": [ "... so this idiom means 'start at the end and go through to the beginning stepping backwards one index at a time', which is a complicated way to say 'reverse'." ] }, { "cell_type": "code", "collapsed": false, "input": [ "up_to_ten[::-1]" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 20, "text": [ "[9, 8, 7, 6, 5, 4, 3, 2, 1, 0]" ] } ], "prompt_number": 20 }, { "cell_type": "markdown", "metadata": {}, "source": [ "##Rosalind: Counting Point Mutations##" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`zip()` is useful when you have several lists that you want to process element-wise together. \n", "\n", "(Questions: Why do you think this is called zip? What built-in function that we have seen is equivalent to zip(range(len(seq)), seq)?)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "numbers = range(4)\n", "letters = 'ABCD'\n", "\n", "for n, l in zip(numbers, letters):\n", " print n, l" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "0 A\n", "1 B\n", "2 C\n", "3 D\n" ] } ], "prompt_number": 21 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lets use `zip` in a filtered list comprehension to produce a compact, simple Hamming distance function." ] }, { "cell_type": "code", "collapsed": false, "input": [ "def hamming_distance(first, second):\n", " ''' Returns the number of positions that differ between first and second. '''\n", " distance = sum(1 for f, s in zip(first, second) if f != s)\n", " return distance" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 22 }, { "cell_type": "code", "collapsed": false, "input": [ "file_name = 'data/rosalind_hamm.txt'\n", "fh = open(file_name)\n", "first = fh.readline().strip()\n", "second = fh.readline().strip()\n", "\n", "print hamming_distance(first, second)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "482\n" ] } ], "prompt_number": 23 }, { "cell_type": "markdown", "metadata": {}, "source": [ "##Rosalind: Enumerating Gene Orders##" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `itertools` module is full of all kinds of cool stuff." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import itertools" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 24 }, { "cell_type": "code", "collapsed": false, "input": [ "how_many = 3\n", "perms = itertools.permutations(range(1, how_many + 1))" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 25 }, { "cell_type": "markdown", "metadata": {}, "source": [ "In general, it doesn't make sense to ask for the length of an iterator." ] }, { "cell_type": "code", "collapsed": false, "input": [ "print len(perms)" ], "language": "python", "metadata": {}, "outputs": [ { "ename": "TypeError", "evalue": "object of type 'itertools.permutations' has no len()", "output_type": "pyerr", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[1;31mTypeError\u001b[0m Traceback (most recent call last)", "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m()\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[1;32mprint\u001b[0m \u001b[0mlen\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mperms\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[1;31mTypeError\u001b[0m: object of type 'itertools.permutations' has no len()" ] } ], "prompt_number": 26 }, { "cell_type": "markdown", "metadata": {}, "source": [ "For iterators we know to be finite, we can 'cast' to a list." ] }, { "cell_type": "code", "collapsed": false, "input": [ "list_perms = list(perms)\n", "print len(list_perms)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "6\n" ] } ], "prompt_number": 27 }, { "cell_type": "markdown", "metadata": {}, "source": [ "To produce the desired output format, we need to convert lists (actually, tuples) of numbers to lists of strings.\n", "Let's use a list comprehension." ] }, { "cell_type": "code", "collapsed": false, "input": [ "for perm in list_perms:\n", " strings = [str(value) for value in perm]\n", " print ' '.join(strings)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "1 2 3\n", "1 3 2\n", "2 1 3\n", "2 3 1\n", "3 1 2\n", "3 2 1\n" ] } ], "prompt_number": 28 }, { "cell_type": "markdown", "metadata": {}, "source": [ "##Rosalind: Finding a Motif in DNA##" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One way to find all occurrences of substring in string_to_search is to check if the substring is a prefix of each suffix of string_to_search." ] }, { "cell_type": "code", "collapsed": false, "input": [ "def find_locations(substring, string_to_search):\n", " ''' Returns a list of all positions in string where substring occurs. '''\n", " locations = []\n", " for i in range(len(string_to_search)):\n", " if string_to_search[i:i + len(substring)] == substring:\n", " locations.append(i)\n", " return locations" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 29 }, { "cell_type": "code", "collapsed": false, "input": [ "file_name = 'data/rosalind_subs.txt'\n", "fh = open(file_name)\n", "string_to_search = fh.readline().strip()\n", "substring = fh.readline().strip()\n", "\n", "one_based = [str(l + 1) for l in find_locations(substring, string_to_search)]\n", "print ' '.join(one_based)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "2 63 74 92 99 126 143 158 166 174 199 253 315 332 339 363 403 410 426 433 500 526 535 542 582 589 661 733 752 768 775 808 815 822 872 950\n" ] } ], "prompt_number": 30 }, { "cell_type": "markdown", "metadata": {}, "source": [ "When you are searching for matches to a pattern in a string, you should think 'regular expressions'. (Of course, [there is this](http://regex.info/blog/2006-09-15/247).)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import re" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 31 }, { "cell_type": "markdown", "metadata": {}, "source": [ "`re.finditer(pattern, string_to_search)` will produce (an iterator over) objects representing matches of pattern in string_to_search. These match objects have a `start()` method that gives the location of the match." ] }, { "cell_type": "code", "collapsed": false, "input": [ "pattern = substring\n", "for match in re.finditer(pattern, string_to_search):\n", " print match.start()," ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "1 62 73 91 125 142 157 173 198 252 314 331 362 402 425 499 525 534 581 660 732 751 767 807 821 871 949\n" ] } ], "prompt_number": 32 }, { "cell_type": "markdown", "metadata": {}, "source": [ "A gotcha here is that overlapping pattern matches will not be found. (By default, the re engine 'consumes' the string as it moves through, so regions that are part of one match never get looked at again to be part of a potential second match.)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "matches = re.finditer(pattern, string_to_search)\n", "re_one_based = [str(m.start() + 1) for m in matches]\n", "print re_one_based == one_based" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "False\n" ] } ], "prompt_number": 33 }, { "cell_type": "markdown", "metadata": {}, "source": [ " Adding a (cryptic to the uninitiated) lookahead flag to the re pattern changes this behavior." ] }, { "cell_type": "code", "collapsed": false, "input": [ "pattern = '(?={0})'.format(substring)\n", "matches = re.finditer(pattern, string_to_search)\n", "re_one_based_2 = [str(m.start() + 1) for m in matches]\n", "print re_one_based_2 == one_based" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "True\n" ] } ], "prompt_number": 34 }, { "cell_type": "markdown", "metadata": {}, "source": [ "##Rosalind: Protein Translation##" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One of the strengths of Python is the existence of high-quality modules for solving common problems in different domains, such as Biopython." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from Bio.Seq import Seq" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 35 }, { "cell_type": "code", "collapsed": false, "input": [ "file_name = 'data/rosalind_prot.txt'\n", "fh = open(file_name)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 36 }, { "cell_type": "code", "collapsed": false, "input": [ "dna_string = fh.readline().strip()" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 37 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Make a Seq object out of dna_string." ] }, { "cell_type": "code", "collapsed": false, "input": [ "dna_seq = Seq(dna_string)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 38 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Translate to amino acid sequence, with Biopython taking care of knowing the genetic code." ] }, { "cell_type": "code", "collapsed": false, "input": [ "aa_seq = dna_seq.translate(to_stop=True)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 39 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Convert from a Seq object to a string." ] }, { "cell_type": "code", "collapsed": false, "input": [ "aa_string = str(aa_seq)\n", "print aa_string" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "MGMREIQTLADLSSVLGIRYRVVRRRSCQRVIKTWVSYVIRTFHPHRVGLAGLEGRSLGTHMVGANGVMLTVTTLTSLGAIAAISSRPSQSIRGSVFSPRTIYGIQAYQHIHSPAYGKTWPSRERNSRTSSGQNCAAMLLELQHPRLVTCRSRNMFLGTACLPLSGPDSVSIEIMALVEFGGVSWKRIALSVYSRMGAGSTWALTQITKHIDTGTRLAGAHCLCWARGYKFVSGPTPPANWTLEYHSPGTSYNSTSCKRGMITRRMSWHLFGSLPNLSCKIHCCSRCDHWERSIRATCFRWSQCKAQECSDIDVDHICFAARGSRKTSLASWIVGYQWVVVLMETGALKAGLRPRAMRIQSPSSVSSFGTRVAHFRSRGPIGQPSSAESMIVAYWTIAYKLCRAPYLVVERLSNWSMTVYTFRLYKTNHGIPSNIKTPSRTARAFPTWKIPTDPVTAARFGGSRCNTKPKSPRVFEADIDRCSYSCDLKAQYQLSPDEITCLLSLVISVKHPRRRLLQSAFYCPMVGLGRSLKRGCQPFTACNDWTYRPSVKGSLACCAVLLQVSSVGRFIFPKRPKTIFLGERWPRARIISCRTIKAMKQYILIRRPTVAPEVCLRTSDPMHPLRLQPRGSCEYMLYSHLSRSRKARHAIETPRAPITKLLGSRTLPPQRVEVPIDSLVPLSPLQYGRQLWVFRATVWRGFLKTRLSQDLIVQSKMQNFYMLMSLQSTPEKNRPNPKNIRKFRTQCTSLLYCRNQHLRISFPRDGLVAVRGVCYLNRPRSLKVHRYRANIVRFKQFGVITVRYPSVTTGGGEKWSPDGRYMITRVNHRQMRARLPLNDRAHLLYPLGAELIFWYVGAARIHQMADDRNHMASTGLGTNCVREALAILARGRCMPRSHATFTSVRLHLTLTVPESHLCWKELLRAFSLATRNLQLVASIKPSTKHSHTINNLRRWVWLTRWMESYLPLYLVLRRCRVTLLLLLILTLSGQQIADGSDRGNHFTTLLHLYLDTKNHGKVPAVEPSSIIPLHSKPTHSGLLMSQYHRVRSCPVISRYAYCGRQPRVHNVIARNVFRPKVSEASYVKNTSTLRHSEYSNSCGLSALIDQNIFEAIYEGNVNTCENARSVKRRVISVILPYRLLMYICRAEAATPFLLRVNTRPDGDGANININKFVATSNIEAPDYKVNDKAATVGDAEVAIGCLELWSASTFCAVIGGTQCAILLMISDCALHISVRVPLSRAEHRCIGVTGGAHVDLPFARGIRHAGWHHARVDRGYDGVTLHFRQIIRKDFGSAGIIRVDGSHDFVSKDSYTNGVYSKNFTSTDPLNEITYDTERCYRVSITGLRLTYQCRFGSQVQMQTPLTNATIELYSRVNEPRTYDRFSRKLASNYIGKLVASWPNTAVGASRNTRRLGRTVASWGSESVEQTVLSCLLSLSVLQLIESARKPSVPKCYELRALSPMTCALHTSVITATVLDVNPQRVKVWLPGFLQSDARYDVVGPPFVKRSDGKSCPLRGPYPPAISDLSPWPLHPEPLFGTRTGTQPSMYSKRVFLLPTLRVGTSQADVTRMTAYCTLRLAHPTFTGASEQVQDTIPVIRSNGAHFLKAPYPLGHKFRCTSTASAGWHVDLITASGSFTTYVRLNGSTLTIVWRSPWSTHFDKASFSTVVLLRCSFSYCGSPRGANRGLALHQRYLLPWRQGAKKRILLFVRQLCTAMNFSFIETVRLEPNYSKHATVSTAGPARTCCYVFRANAKTGLVELCSPIAYPENLIGVLELKCFRHSPSSARACQPLPVAWRPHNITEPQTIYGYTQFANFSRYSPEYSTFTGVMASLYSPGSAPIAMREEICQLALADAVGPRKSPWLGQLTPLHSYVTSTSMSTDGRWRGPRQSSLCAPGARASQPTTLWAYPSSKPLRRRQIALTRQRYKSGQTMACVLLCLAIFIGLRCNQPKINSISLVTFYVSSVILHYTHGPVLGSRFFHLSGYHAMRPRDACSIKMICGTGPSPCLRVANNTLAVVTHHIYILALDLPSSRSCLTTGRDLRATLVYVYAFPFPNGAFRGGWLLRFNPLLKPSRLYREGYIHESELGPMHASGSGGTSCPIEAGVCIDRIPQERLIQVRLLMLFVLAASLTLGAQNIAIQRARSVVYAIPLYVHKRLRLGSMLHDSTLALGYLKTRFSVFRGRRLAAASIPDLAQLTDDVRLLPHVKQSTCINDDVLQPKSKSVSGITKRIELGPNKKGRTPGVNTLKQLVGDTRGVLRKTRGLSAVSTHPTRSSNCGVYDEGKPGGTSNNLLYKTILATALLGQLVSIAYMCLLCRGFKLANITERTSGSTYCLPVIASTTPAFQPTGNRTLGSTKPTRKGELLGGMPCFCNYRPSSRSRVTTLIRVSYVCSCEINKVGTSPNLPPERLASSYFGSRSLVGLIGVAIVESSDCSRVRKLLFVLCKQPIKPLFDKTPAEPLRAVVTSLCYASIQELPPGLRLATNIRKNAILGPICIRSDHLRSGLVLQNQITRTLATILALTQWEFNPARWLPEPRSDGDDLAASPKAYDPQLCNFFDLRRRISRTCKELSNGCGSFSSIIVLAGLIVSKSYVKEQRSLSGTRIFLSFEAPAVCLMLQIRKKWINGDAMEREAPVTSCIKQKRIINAFSFAVLQQSVLSAPAPKCHHYNRCPFESSPFNAVSVSLGHSRTPYNPCSQQAGDVMTFLEKSCDQVVGIVVRSQYCIATNHVESRCGWKVPSEAWLLNKLLRSISGCTGNNARNTAWSMTRTEVHSWKWQYANGGLRGEGMKHFRQWGHLHMLGFVRDRRSEHYLVPNNPSTCDWPQSSSRAIGELHIFGVSAVRTSGLLFSPLCSGDSGSNLEGALNEYTDHAFKPVTVLTSYKGPRHTRSDRWLHDVITTAIRRRCQFTAELSPRFNAAPHMGRTCVFRCVLPTSNASLNPPMSTRHRSGFQSYNVKIRSSGTYSPGRVESPVCLWQIPLRALTNSPAPDIMFILGKQHLFALELVLRPTNEAKDRDRISYSSPAPALLVTSGGSPVSYNTWPIMNSRCSEFASTVRLTFYLSHRRSRETRVTSIIQVPKSSQCALTERTVDRTVGREAISVRQLLSHSATAVFLIKTANQAPLIRGQPSCFNGFDASHVHHWHTITRRPLQLLTGGNLYIKRHHHLWACKVSVDFWIGCPVPGAIPDVFPGGALGSEVRLGARRFVINPPRPGPLHNPQLNLGSRLDLYCGSRRVLLEKGDGRRTV\n" ] } ], "prompt_number": 40 }, { "cell_type": "markdown", "metadata": {}, "source": [ "##Rosalind: Computing GC Content##" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Parsing a FASTA file is a little messy. After each line that starts with a `>`, we need to read an indeterminant amount of lines until we encounter the next one that starts with a `>`.\n", "This function takes in a file name and return a list of tuples of (name, sequence)." ] }, { "cell_type": "code", "collapsed": false, "input": [ "def fasta_records(file_name):\n", " ''' Returns a list of (name, seq) pairs for fasta records in file_name. '''\n", " records = []\n", " fh = open(file_name)\n", " name = fh.readline().strip().lstrip('>')\n", " while name:\n", " seq = ''\n", " line = fh.readline().strip()\n", " while line and not line.startswith('>'):\n", " seq += line\n", " line = fh.readline().strip()\n", " records.append((name, seq))\n", " name = line.lstrip('>')\n", " return records" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 41 }, { "cell_type": "code", "collapsed": false, "input": [ "file_name = 'data/rosalind_gc.txt'\n", "records = fasta_records(file_name)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 42 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Argument unpacking makes code more readable." ] }, { "cell_type": "code", "collapsed": false, "input": [ "for name, seq in records:\n", " print name\n", " print seq\n", " print" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Rosalind_8336\n", "GCTCACCCGTGTCGCCCGAACATGCTTAACTAACTTGGACAGGTTTGAAATAGCACTTGAAGAAAACGTTTCAGCGATGACCAGACGAGGATACGATCGCTCCTCGGAAGCCCTCTAAGCTCGTCGCAGTGAACAAGACTCTAAGAGTCTTCAGACGCTTGGCGCCGGTAGGTGGTTCTGTGCACTACTCCACGTCGTCCGGTCGAGTAATAAGCAGGGATCTTTCGGCGTTGTACTAACAGCTCATAACTCCAAAGCCAAAAACTCAACTGGGAAAGCGTAGTGTCAGATGACTGTACACTCACACAAGTACGGTCGGCAGGGGCACCGTTACGCCCCAGACCGTTGCGTATTTCTGCAGTCCGGCGTAAGTAGATTGTCCGTGGGCTTACTCTTGGAATTATATCATGTCTCTGGCTTGGAATTGGCGTTACATTGTGGGGACCCGCTGCGGCCGTCCCTATTCATATCATTTTTGAGCATAATGATTTTTTACAACAAAAATGTAGCACGGAATGCCGAATTTAAGTCAGTTTACATCAGTGAAGCGGTTTGCCCAACGGCCAGACTGCGTGGGGCACAGTTGGCCTATGTAGGACCACTCGTGCGCATTAGAACTTCAACCCGTTGACTACGCCGGAGCGGGACAACGTAATTCATTGATCACTGGACCCGACAGTACGGGTGGACACAATGTGTCAATGTAGGCCAGCATAGTTAAACATCGAGCCCCGTGTGACAGAAGTTTGAAGCGATGTAGTGATGGTCTTCACTTGTCATGACGGTGGGGCCTCGGGAAAGTTGGCATCCCTACAGCGTTTCAGATGCGGTCCAATGCTAGCGGACAATTCTAGCACGTTCTAGGCTGCCGCTAGGTCAACCCGCCACATGATCCACTGAAAGT\n", "\n", "Rosalind_6236\n", "TATGCAAATTTATGGCGCGGCCATTGAGTAAGACTTCAGCAGCACCTGATGTTCGCATCTAAAACCTGTTAGGTACTGTATTGCAATCGAATATTGTGACCCGGTCCACACGCACGCATACTTGTGCTATGAACAATATCAACTACCCCACACAAGCGTTGGCGGTTGGGTACTCGAAACTATAAGATTCGCCCATGTGTGTAAAGGCGTTGTCATCTCACTGAAACCTTGCCTCGAGTCCTAGAGAAACTGTTTAGAATGAACCGCTATCGGCCAGCGATGCCCCTCCTCTGTCTGGCCGTAATTGGGGCCTAGACTCCGATGTATCTGATGTATGCACGGAAAAAGCTCCACGCAGGTTACGGAGGGTGCGCTAATAATGGTCACAGCGGGTCGATGGGGTTACGTTATACATTCCCAAGCAGTTCGACATCCCGCAGATCGAATACACACTGTCATGCAGCGTCACTGCCCACCTGCACCGCCGGTTAAACAAGAATTAGTTCCGGTCCTCCGCCTGGGTGATGGGTAACTTCGGCATGAGGTTACAACCACGAGATCGACAAATGCCTCACCGGAGGAGATTCGAACCGCGTGGGATGTTCGAATTGGCGCCCCGCATAGTTCCGTTCCAATATCATCTTTTGGCCACTGATCCCTGCGGGTTCTTGCCGGAGCCGGTTGAAATGTGGGCGCGACCTGGCAACATTTGAATGGCAGATCCAGACCACAACTGTAGGAAACCCTCCTATTCTAGTCTGGATTCAACTACCGTGCACCACAGTAGTTGGACTGAACGGACAGCTTCCGTCTGGGCCTTTCCTTCAAGGGTTGCGAATTAACGCCCTTCCCCTCTTCACTAGTCAGCGTTCGCGGACCCTAACGAAGACGAGAGATCATCATTGGTGTACTAGCGTC\n", "\n", "Rosalind_9301\n", "GATATGAGGTGACCTTTCGGGTATCGAACAGCGAAACAACGGTGGCATAAGCTCTCTCTGCCTTTGATATTTCCCAAGTGCGTGTCTGTGTGGATGAGCCTTCGACCAGTCTACCACTCAACAGTGAGTTAAAAGCCAAATTACAAATTTTGCGGTTTCCCTTCTTGGTGCGCATGCCTATCAAGCTGCCGTACTTTAAGAGACGGCAAACCTCACAGTACGCCAAGGGGTTATTCGTCTTGTCCGCGGATAGGATATGTCAGCGCGCGATAACGAGCTCATCCAAAGATTCGCGTAAGCATTGCCAGGTCGTAAGGGCTTAGCAGTCGAACTTTATTAGTATCCAAATCAGTCAAGGAGACGGATTCCCTTAACGCCTACCTGATAAACTACCTAGCCCTGTTACGCCTGCGAGAGATACCCGCGGCGAGAGTCGGATGCTCGATGCCGGTAGCCTGTCTAACGGTATACTAGTAAAATCGGGTGTCAGCAAATGTAGGTTACACACGCGTTTAAGTAACGCTGACAAAGAACCTGTCCGTGCTTCTTGCCAACACTATCTTCATCGTATTCTACTTAAGCGTGCACACGGGTCCTCTTAGACCTAGTTATTTGTGTGAAGTAGACCGCTGGGCGAGTAAGCCGGTTTATTCTCCGCAGACTTGGCCAACTCGTAGTAGATCTCACAGAATCCCATCAGGTATAACACAGGCAACTGGGCGGCGGCTCAAGAAGCCGGGATAACTTAGGGGGCTATTCGGAGCTGTCTTGGGGAGCTGGTATCGAGTAGCGAGTTAATAATACTTGCAAA\n", "\n", "Rosalind_6955\n", "TGACTTATCAAAGACCTAGGAAGTTTCAGGCCTCTCTTAGTATGCCACCCACGCTGGACGGCGCTGAATGCCTCAACCAGAACGACCCGTAAACAAAAACCGGAATCGGGCTGGGATGTTCTAGGCGGTAAGGTGCGATGCTGTAGCTTATTTCGCACACTCCCCTAATTCATAAGAAACGTAACGCCGTAGCCTAATTTACTCCGATCGAAACGCATGAGTCGTGCGTTTGGGGGAGCAAAGATCATCATTGAGGCGATGGGCCGCCAAGTAAGCATGGTGGGTGTGGGTCTTAGCGGCTGCTTGGGGCGTGGACGTGTCCGCGCTATACCCGAAATTTTGCGGCAAGCTCGGCCTGGGAAGTGAGCTCCAGCGAACGTAGTATCTGGTGCGGACAGACTTAGCAGTATCAAAAAACGTCGGTCCCATCTTACATAGGATCTTTAACTGTATGTGCCATCGCATGCGTAACTTCAGTCGCTCGGTTCCGACGGAGCGGACAGGTCCGTGAACGGTGACCCCATGCAAGCCAGAACTCCAGGGATCGCCGGCGGGCGTCTGAAGGAACAAACACCAATGTAACATGCTTAACCCGCGCGATGTTTAAAACTCTGCACCATGAGCTTAGATACTGGTGACCATTCCGAGCCGCCCTGATCCGAATCTCTTGTAAACTGTCGGAGGGTACTATTGGGGCATGTGTGGAGGACCGAGATGTACGAATGGATATAGCTTAACGCTGCTGCACGAAATCGACTGTTACGTCTCACCTGTTCGGCCTTAGCGTCAGACCTGAACCCAGAGGTGCGCGTTACATTGAGGACCGGTACGAAATTGGCTCTGATTATAAGTCATCGTAAGAGGATTACAAGCGCATCTGCGTGGCTCTCGTGCCGTTCGTTTTGGAACACCTCCGATTGGAGGTTATCAGTCAGCCCATCAATATCAGTGCTAACCAGCTCGTGGCGCCATGCCCGTTACCCGAAGATTCAACC\n", "\n", "Rosalind_0896\n", "GCTGCAAAACGTACTCAACAGAATAGTAACCAGCAATCCGGATTCGCATGCTAAGTAGTATCAGGAGCCACGATTATTGCCTGCGCTTGCGTATGGCAGCCAGTTGTGAAGCGGTGAAGGGATATGATAGTAAGAAATCTTTAGAACCAGAGTGCATGACATGTCCGACCCTTTTGGGCCCATTATCCGCGACGTCCAGTCGTACCGTACCGGCGGCCCCTTAATTCTAGCTAAAGTTAGTCTCATAGACGTACGGTCAACCATCTGTCGTTGTCGCAGCAGATTAATCACGCTTATTAAGATCCGCTTCTACATCCACGTCTCTAGTACCTGCTGCCGACGCATCCGCTAACTATGCTTCTTGTGCAGTCGTCTATGCGCCGAGCCCATTATCTAATCTCTGCATACTTGCGTGTCCACGGATTTCCGTATTCTAACCCTCGGATCTACGTTTGCCGGCTGTCGACCGATCTTTAGGCTCACAGTCATCCGACCTGATCTGGGTCTAGTGCCGGTTATTACAGTTGACAGGCCATGCGTTCCACGGTGGAGACTTATACGTTGGGTGGAATGCTTCTTTTAACTGTGCATTTGATCCCTGCGTCAGTCTTTTTCAGACCTTAAAACCGACCGTGCTTGACGGCGAGTCACCGTCCATATAATTGTAGGACCCCGCTCTAGACTCGTGAGGTCATAGTGTAGGCCAAAGAGGTTGTTGTGCTCGTCTGTACTACCAACAGCTCTAGAAGGACGCGTGCAGTCATGTGGAGATGGGGGGGCCGCACGTTGAAAGGACCGGCAAGTCCACGACCCAACAAGCGTTTCGATACCCAAGCAGGCAGATTCATGTTACAATCGTGCGGCTCAGTACCAATTGCATAGTGTCACAGACATTGATATGTCTAAGTACTTATCCCGGGATTGCCACCTGACAGTTT\n", "\n" ] } ], "prompt_number": 43 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Conditional expressions can be used to filter list comprehensions." ] }, { "cell_type": "code", "collapsed": false, "input": [ "test_seq = 'GCATATATGCTAG'\n", "gcs = [char for char in test_seq if char == 'G' or char == 'C']\n", "print gcs" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "['G', 'C', 'G', 'C', 'G']\n" ] } ], "prompt_number": 44 }, { "cell_type": "code", "collapsed": false, "input": [ "gc_count = sum(1 for char in test_seq if char == 'G' or char == 'C')\n", "print gc_count" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "5\n" ] } ], "prompt_number": 45 }, { "cell_type": "code", "collapsed": false, "input": [ "def gc_content(seq):\n", " ''' Returns the GC content of seq as a percentage. '''\n", " gc_count = sum(1 for char in seq if char == 'G' or char == 'C')\n", " return 100 * float(gc_count) / len(seq)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 46 }, { "cell_type": "markdown", "metadata": {}, "source": [ "lambda expressions can be used to create anonymous functions.\n", "We will use a lambda expression to tell `max` to sort a list of tuples by the second element in each tuple." ] }, { "cell_type": "code", "collapsed": false, "input": [ "records_gc = [(name, gc_content(seq)) for name, seq in records]\n", "name, gc = max(records_gc, key=lambda (name, gc): gc)\n", "print name\n", "print gc" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Rosalind_6955\n", "52.6633165829\n" ] } ], "prompt_number": 47 }, { "cell_type": "markdown", "metadata": {}, "source": [ "##Rosalind: Finding a Protein Motif##" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I have placed a file `gc_content.py` in the same folder that this notebook is in. Now I can import `gc_content` and have access to all its guts in a namespace called 'gc_content'." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import gc_content" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 48 }, { "cell_type": "markdown", "metadata": {}, "source": [ "This `import` statement evaluates all of the code in `gc_content`. Sometimes this is not what you want to happen. The file being imported will have some stuff it is supposed to do on its own that you don't care about when you are importing it. The `if __name__ = '__main__':` idiom takes care of this problem." ] }, { "cell_type": "code", "collapsed": false, "input": [ "!cat gc_content.py" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "def fasta_records(fh):\r\n", " ''' Yields (name, seq) pairs for fasta records in the file-like object fh. '''\r\n", " name = fh.readline().strip().lstrip('>')\r\n", " while name:\r\n", " seq = ''\r\n", " line = fh.readline().strip()\r\n", " while line and not line.startswith('>'):\r\n", " seq += line\r\n", " line = fh.readline().strip()\r\n", " yield name, seq\r\n", " name = line.lstrip('>')\r\n", "\r\n", "def gc_content(seq):\r\n", " ''' Returns the GC content of seq as a percentage. '''\r\n", " gc_count = sum(1 for char in seq if char == 'G' or char == 'C')\r\n", " return 100 * float(gc_count) / len(seq)\r\n", "\r\n", "if __name__ == '__main__':\r\n", " file_name = '../data/rosalind_gc.txt'\r\n", " fh = open(file_name)\r\n", " records = [(name, gc_content(seq)) for name, seq in fasta_records(fh)]\r\n", " name, gc = max(records, key=lambda (name, gc): gc)\r\n", " print name\r\n", " print gc\r\n" ] } ], "prompt_number": 49 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that the `fasta_records` has been changed to accept a file-like object instead of a file name, for reasons that will be clear below.\n", "\n", "If we write docstrings for our functions, we can get reminders about what they do. \n", "This is what pops up in your browser if you were to evaluate `gc_content.fasta_records?`. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "gc_content.fasta_records.__doc__" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 50, "text": [ "' Yields (name, seq) pairs for fasta records in the file-like object fh. '" ] } ], "prompt_number": 50 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice the `yield` keyword in the `fasta_records` code. That makes it a special kind of function called a generator. Generators are a convenient way to define your own iterators." ] }, { "cell_type": "code", "collapsed": false, "input": [ "def simple_generator(n):\n", " for i in range(n):\n", " yield i\n", " \n", "my_gen = simple_generator(4)\n", "print [number**2 for number in my_gen]" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[0, 1, 4, 9]\n" ] } ], "prompt_number": 51 }, { "cell_type": "markdown", "metadata": {}, "source": [ "`urllib` provides a way to read files from the web as seamlessly as if they were local. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "import urllib\n", "\n", "opened_url = urllib.urlopen('http://www.uniprot.org/uniprot/A2Z669.fasta')\n", "name, seq = gc_content.fasta_records(opened_url).next()\n", "print name\n", "print seq" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "sp|A2Z669|CSPLT_ORYSI CASP-like protein OsI_33147 OS=Oryza sativa subsp. indica GN=OsI_33147 PE=2 SV=1\n", "MRASRPVVHPVEAPPPAALAVAAAAVAVEAGVGAGGGAAAHGGENAQPRGVRMKDPPGAPGTPGGLGLRLVQAFFAAAALAVMASTDDFPSVSAFCYLVAAAILQCLWSLSLAVVDIYALLVKRSLRNPQAVCIFTIGDGITGTLTLGAACASAGITVLIGNDLNICANNHCASFETATAMAFISWFALAPSCVLNFWSMASR\n" ] } ], "prompt_number": 52 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Regular expressions make it easy to look for the N-glycosylation motif." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import urllib\n", "\n", "motif = re.compile(r'(?=(N[^P][ST][^P]))')\n", "\n", "file_name = 'data/rosalind_mprt.txt'\n", "fh = open(file_name)\n", "\n", "for line in fh:\n", " uniprot_id = line.strip()\n", " url = 'http://www.uniprot.org/uniprot/{0}.fasta'.format(uniprot_id)\n", " opened_url = urllib.urlopen(url)\n", "\n", " for name, seq in gc_content.fasta_records(opened_url):\n", " ps = [str(match.start() + 1) for match in motif.finditer(seq)]\n", " if ps:\n", " print uniprot_id\n", " print ' '.join(ps)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Q3T0C9\n", "15 38\n", "Q32LI2" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "157\n", "P00740_FA9_HUMAN" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "203 213\n", "P04233_HG2A_HUMAN" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "130 136 256 270\n", "P22891_PRTZ_HUMAN" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "99 225 233 306 332\n", "P80195_MPP3_BOVIN" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "95\n", "P02974_FMM1_NEIGO" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "67 68 121\n", "B3CNE0" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "107\n", "P98119_URT1_DESRO" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "153 398\n", "P21810_PGS1_HUMAN" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "270 311\n", "Q9D9T0" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "154\n", "Q4FZD7" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "528\n", "P19823_ITH2_HUMAN" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "118 445\n" ] } ], "prompt_number": 53 } ], "metadata": {} } ] }