{
 "metadata": {
  "name": "",
  "signature": "sha256:04e90c056a04c517a526ed5cca1d3f05d8b46c3c3a4b8e853db578d21c41a1db"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "[![Py4Life](https://raw.githubusercontent.com/Py4Life/TAU2015/gh-pages/img/Py4Life-logo-small.png)](http://py4life.github.io/TAU2015/)\n",
      "## Lecture 6 - 29.4.2015\n",
      "### Last update: 26.4.2015\n",
      "### Tel-Aviv University / 0411-3122 / Spring 2015"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Previously on Py4Life\n",
      "\n",
      "- Modules\n",
      "- Files\n",
      "- Regex\n",
      "- Debugging"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## In today's episode\n",
      "\n",
      "- Biopython\n",
      "- Sequence analysis\n",
      "- Sequence data in python"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Biopython"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### What is Biopython?\n",
      ">Biopython is a set of freely available tools for biological computation written in Python by an international team of developers.\n",
      "\n",
      "(from the Biopython [official website](http://biopython.org/wiki/Main_Page)).  \n",
      "\n",
      "Simply put, Biopython is a python package (collection of modules), which includes lots of functions and tools associated with __bioinformatics__."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### What's in the Biopython package?\n",
      "* Ready-to-use parsers, dedicated to biological data file formats:\n",
      "    * FASTA\n",
      "    * GenBank\n",
      "    * BLAST output\n",
      "    * Newick\\Nexus phylogenetic trees\n",
      "* Special data types to handle biological data:\n",
      "    * Nucleotide sequences\n",
      "    * Protein sequences\n",
      "    * Protein structure\n",
      "    * Phylogenetic trees\n",
      "* Easy accessibility to biological databases:\n",
      "    * NCBI BLAST\n",
      "    * PubMed\n",
      "    * GenBank\n",
      "    * SwissProt\n",
      "* Bioinformatic tools:\n",
      "    * MSA\n",
      "    * Clustering\n",
      "    * Sequence manipulation  \n",
      "\n",
      "And lots of other stuff...  \n",
      "We will not cover everything today."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Installing Biopython\n",
      "Biopython is not part of the core distribution and thus needs to be installed before it can be used. To install the package, use `pip install Biopython` or another installation method, as explained in lecture 4. "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Using Biopython\n",
      "Once installed, we can import the package using:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import Bio"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In practice, we won't usually do that, because this package is _huge_, and we only need certain parts of its functionality. Therefore, we can do:  \n",
      "`from Bio import <module name>`. For example:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from Bio import SeqIO"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Documentation\n",
      "Biopython has very good and very vast documentation.  \n",
      "A great tutorial is available [here](http://biopython.org/DIST/docs/tutorial/Tutorial.html).  \n",
      "If you need a specific task, you might want to take a look at the ['Cookbook' pages](http://biopython.org/wiki/Category:Cookbook) or the [wiki pages](http://biopython.org/wiki/Category:Wiki_Documentation).  \n",
      "Biopython's online community is very active. Join one of the [mailing lists](http://biopython.org/wiki/Mailing_lists) for lots of useful discussions and lots of spam..."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Introducing Sequence objects\n",
      "![biopython](lec6_files/biopython.jpg)"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "So far in this course, we have seen most of the basic data types in Python: str,int,float,list,dict etc. We've also briefly talked about some specialized data types such as file objects (generated when using the `open()` function) and the csv reader.  \n",
      "One of the most basic objects in bioinformatics are sequences, and Biopython has a whole mechanism for dealing with sequences. This is done by introducing a new data type, namely a _sequence_ data type.  \n",
      "In principal, we could just use strings to represent biological sequences."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "seq = \"AGTACACTGGT\""
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "But using Biopython's specialized data type, we get much more functionality, novel to sequences, without giving away the string functionality. That's achieved thanks to special _methods_ of the sequence data type (but not the string data type)."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Creating sequence objects\n",
      "To work with sequence objects, we'll first have to import the relevant class (specific part of the module):"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from Bio.Seq import Seq"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now, creating a sequence object is done using `Seq`, on the required sequence string. Like this:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "my_DNA_seq = Seq(\"AGTACACTGGT\")\n",
      "print(type(seq))\n",
      "print(type(my_DNA_seq))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We can store amino acid sequences in the same way:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "my_AA_seq = Seq(\"EVRNAK\")\n",
      "print(type(my_AA_seq))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### What can we do with sequence objects?\n",
      "As we said earlier, sequence objects have all the functionality that strings have. That means that everything that we've done with strings can also be done with sequence objects. Some examples:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# printing\n",
      "print(my_DNA_seq)\n",
      "# length\n",
      "print(len(my_DNA_seq))\n",
      "# Accessing specific elements\n",
      "print(my_DNA_seq[0])\n",
      "print(my_DNA_seq[7])\n",
      "print(my_DNA_seq[-1])\n",
      "# Slicing\n",
      "print(my_DNA_seq[3:10])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Notice that slicing a sequence object results in a new sequence object:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "partial_DNA = my_DNA_seq[3:10]\n",
      "print(type(partial_DNA))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Sequence concatenation\n",
      "seq1 = Seq(\"AGGCACCATTA\")\n",
      "seq2 = Seq(\"TTACAGGA\")\n",
      "whole_seq = seq1 + seq2\n",
      "print(whole_seq)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "OK, so here's when it gets interesting: sequence objects have some special and powerful methods. For example, finding the complement and the reverse-complement has never been easier:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "my_DNA_seq = Seq(\"GATCGATGGGCCTATATAGGATCGAAAATCGC\")\n",
      "complement_seq = my_DNA_seq.complement()\n",
      "print(complement_seq)\n",
      "rev_comp_seq = my_DNA_seq.reverse_complement()\n",
      "print(rev_comp_seq)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We can transcribe DNA to RNA, which is basically just changing T -> U (assuming that we are using the 'sense' strand)."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "coding_DNA = Seq(\"ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG\")\n",
      "mRNA = coding_DNA.transcribe()\n",
      "print(mRNA)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "And also translate to protein sequence, either from RNA or straight from coding (sense) DNA:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "peptide = mRNA.translate()\n",
      "print(peptide)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## <span style=\"color:blue\">Class exercise 6A</span>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Write a function that receives __antisense__ DNA, as a __string__ and returns it's translation, as a __sequence object__. Test your code on the given string sequence (displayed 5' to 3')."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from Bio.Seq import Seq\n",
      "def antisense_string_to_protein_seq(DNA_string):\n",
      "    antisense_seq = Seq(DNA_string)\n",
      "    sense_seq = antisense_seq.reverse_complement()\n",
      "    prot_seq = sense_seq.translate()\n",
      "    return prot_seq\n",
      "\n",
      "antisense_DNA = \"TACCGGTAACATTACCCGGCGACTTTCCCACGGGCTATC\"\n",
      "protein = antisense_string_to_protein_seq(antisense_DNA)\n",
      "print(protein)\n",
      "assert str(protein) == 'DSPWESRRVMLPV'\n",
      "assert isinstance(protein,Seq)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Advancing to sequence record objects\n",
      "Representing sequences is great, but sometimes we want to associate more information with the sequence object, such as the gene name, references, organism and general descriptions. To do that, we move from simple sequence objects to more complex ones, called _Sequence record_ objects.  "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### GenBank\n",
      "GenBank is a database of all publicly available nucleotide data collected by scientists, maintained by NCBI. Sequences in GenBank are annotated, meaning that the sequence itself is accompanied by additional data, regarding the meaning, source and other details of the sequence. Each sequence, along with its accompanying data, is called a __record__. As part of the NCBI collection of databases, GenBank can be searched using a dedicated search engine, called _Entrez_.  \n",
      "To better understand the idea of sequence records, lets take a look at the GenBank record for the MatK gene of Oryza sativa (rice)."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from IPython.display import HTML\n",
      "HTML('<iframe src=http://www.ncbi.nlm.nih.gov/nuccore/JN114766.1 width=1000 height=350></iframe>')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Notice that there are lots of additional data besides the sequence itself. So a sequence record object is basically just a sequence object, associaated to some other simple data fields."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Creating sequence record objects\n",
      "In principal, we can create sequence record objects 'from scratch', like we did with simple sequence objects. However, in practice this is rarely used. Instead, we usually get sequence record objects from data parsers. For example, here we get the GenBank record shown above as a sequence record object."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from Bio import Entrez\n",
      "from Bio import SeqIO\n",
      "try:\n",
      "    handle = Entrez.efetch(db=\"nucleotide\", rettype=\"gb\", retmode=\"text\", id=\"387865390\")\n",
      "    O_sativa_matK = SeqIO.read(handle, \"gb\")\n",
      "    handle.close()\n",
      "    print(type(O_sativa_matK))\n",
      "except:\n",
      "    print('Failed to fetch record.')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We'll soon learn to use the SeqIO module, but for now let's just look at what we can do with the object we got.  \n",
      "A sequence record object contains some data fields which we can extract by calling their names.  \n",
      "The most basic is the sequence itself, which is called using `.seq`:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "sequence = O_sativa_matK.seq\n",
      "print(sequence)\n",
      "print(type(sequence))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Note that we got a sequence object, not just a string.  \n",
      "Some other basic data fields of the sequence record object are the _id, name_ and _description_:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print(O_sativa_matK.id)\n",
      "print(O_sativa_matK.name)\n",
      "print(O_sativa_matK.description)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "There are a couple of other data fields, but we will only look at one. The _annotations_ field contains more general information about the sequence. In fact, it's a dictionary where the keys are the names of the information and the values are the actual information. For example:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# get the organism\n",
      "print(O_sativa_matK.annotations['organism'])\n",
      "# get the DNA source\n",
      "print(O_sativa_matK.annotations['source'])\n",
      "# get date of submission\n",
      "print(O_sativa_matK.annotations['date'])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Another data field worth mentioning is the _features_ field, describing the roles and locations of higher order sequence domains and elements within a record. This is especially relevant for records describing long sequences such as whole chromosomes. Record features are further discussed in HW6."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Sequence I/O"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The SeqIO module is the easiest way to deal with sequences. It's a wrapper for the Seq and SeqRecord (and other) modules and lets us manage, manipulate, read and write sequence data without doing any parsing.  \n",
      "As usual, we'll start by importing the module:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from Bio import SeqIO"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Reading sequence files"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The file _Orchids.fasta_ contains multiple DNA sequences of Orchids, stored in the FASTA format. In lesson 4 we wrote our own function for parsing fasta files, but the SeqIO module has its own parsers, which we'll learn how to use.\n",
      "The most useful function within the SeqIO module is `SeqIO.parse()`. This function is used to read files including sequence data in one of the standard formats."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "fasta_handle = SeqIO.parse('lec6_files/Orchids.fasta','fasta')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The function expects two parameters: a file path and a string specifying the file format (a full list of supported formats available [here](http://biopython.org/wiki/SeqIO)). It returns a special object which lets us read the file. Once we have the handle, we can use it to iterate over the sequence records in the file, using a simple `for` loop:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "for seq_record in fasta_handle:\n",
      "    print(seq_record.id,len(seq_record))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "As you can see, each iteration of the loop yields a SeqRecord object, which we can treat as we did in the previous section. For example, here we extracted the _id_ and the length of each sequence."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Similarly, we can parse a GenBank format file with the same function. All we have to do is replace the _'fasta'_ string with _'genbank'_. We'll also do it in a more elegant, one-step code:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "for seq_record in SeqIO.parse('lec6_files/Orchids.gbk','genbank'):\n",
      "    print(seq_record.id,len(seq_record))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Another useful trick is converting the sequence file handle into a list, so its elements (SeqRecord objects) are more easily accessed:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "orchid_records_list = list(SeqIO.parse('lec6_files/Orchids.gbk','genbank'))\n",
      "# last record id\n",
      "print(orchid_records_list[-1].id)\n",
      "# number of records in file\n",
      "print(len(orchid_records_list))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "This is not a good idea though, when dealing with large files, since very long lists are not desired."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "If your file contains only one record, you can use `SeqIO.read()` which will return a SeqRecord object."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "record = SeqIO.read('lec6_files/P_emersonii.gbk', 'genbank')\n",
      "print(record.seq)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## <span style=\"color:blue\">Class exercise 6B</span>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Write a function that reads a GenBank file and returns the unique species names found in the file. Use the `.annotations` field to get the organism name of each record.. The species is always the second word of the organism name. The `set()` command will ensure you get unique species (each species will appear only once). Complete the code below:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from Bio import SeqIO\n",
      "def get_unique_species(gb_file):\n",
      "    species_list = []\n",
      "    # iterate on file records\n",
      "    for seq_record in SeqIO.parse(gb_file,'genbank'):\n",
      "        # get species\n",
      "        record_organism = seq_record.annotations['organism']\n",
      "        species = record_organism.split()[1]   # get the second word\n",
      "        # insert species to list\n",
      "        species_list.append(species)\n",
      "    return set(species_list)\n",
      "\n",
      "orchids_species = get_unique_species('lec6_files/Orchids.gbk')\n",
      "print(orchids_species)\n",
      "assert len(orchids_species) == 92"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Reading sequence files as dictionaries\n",
      "Another nice option for dealing with sequence files is reading them as _dictionaries_. This is done using `SeqIO.to_dict()` function. `to_dict()` takes a sequence file handle, created by `SeqIO.parse()`, as an argument."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "handle = SeqIO.parse('lec6_files/Orchids.gbk','genbank')\n",
      "orchids_dict = SeqIO.to_dict(handle)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The function returns a dictionary, which we can work with as usual."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print(orchids_dict.keys())"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "So the keys are the IDs, and the values are SeqRecord objects. It's very easy to access specif records this way:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print(orchids_dict['Z78481.1'].seq)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Again, this could get tricky when working with large files due to memory problems."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Writing sequence files\n",
      "SeqIO also makes it easy to write sequence files. For that, we have the `SeqIO.write()` function. It receives a list of SeqRecords and writes them to an output file, in a format of your choice.  \n",
      "For example, the following function reads a GenBank file, and prints to a new file only records that belong to the organism specified by the user."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def filter_GB_file_by_organism(input_file,organism,output_file):\n",
      "    organism_records = []\n",
      "    # read the input file\n",
      "    for record in SeqIO.parse(input_file,'genbank'):\n",
      "        rec_organism = record.annotations['organism']\n",
      "        if rec_organism == organism:\n",
      "            organism_records.append(record)\n",
      "    # write the desired records to output\n",
      "    SeqIO.write(organism_records,output_file,'genbank')   # notice the 3 arguments\n",
      "    \n",
      "orchids_file = 'lec6_files/Orchids.gbk'\n",
      "C_irapeanum_file = 'lec6_files/C_irapeanum.gbk'\n",
      "P_emersonii_file = 'lec6_files/P_emersonii.gbk'\n",
      "filter_GB_file_by_organism(orchids_file,'Cypripedium irapeanum',C_irapeanum_file)\n",
      "filter_GB_file_by_organism(orchids_file,'Paphiopedilum emersonii',P_emersonii_file)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We can easily change the function to create files in a different format. This format may be set by the user of the function:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def filter_GB_file_by_organism(input_file,organism,output_file, output_format = 'fasta'):\n",
      "    organism_records = []\n",
      "    # read the input file\n",
      "    for record in SeqIO.parse(input_file,'genbank'):\n",
      "        rec_organism = record.annotations['organism']\n",
      "        if rec_organism == organism:\n",
      "            organism_records.append(record)\n",
      "    # write the desired records to output\n",
      "    SeqIO.write(organism_records,output_file, output_format)   # <--- changed to fasta here"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Which means that `SeqIO.write()` can be used to convert between file formats. However, this is not necessary, since SeqIO has its built-in `convert()` function:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "SeqIO.convert('lec6_files/P_emersonii.gbk','genbank','lec6_files/P_emersonii.fasta','fasta')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## A BLAST from the past\n",
      "![BLAST](http://www.mrc-lmb.cam.ac.uk/rlw/text/bioinfo_tuto/images/psiblast_1-289_itr1.jpg)"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "__BLAST__ - Basic Local Alignment and Search Tool, is one of the most important methods in bioinformatics.  \n",
      "_Biopython_ lets you run BLAST from within your code, retrieve and analyze the results and save them. But before we see how to do that, lets recall how BLAST actually works."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### BLAST - reminder\n",
      "BLAST is an algorithm used for comparing a __query sequence__ agains a __database__, or library of sequences. The database may be all the known sequences, stored in NCBI, a subset of these, or any other collection of sequences we want to search. The sequences in question may be nucleotide sequences, amino acids or even codons.  \n",
      "The result of a BLAST run is zero or more database sequences, which are similar enough to the query sequence. These results are called __hits__.  \n",
      "![BLAST](lec6_files/blast.png)"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Lets run a nucleotide BLAST through the NCBI web, on the sequence:  \n",
      "GTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCAGTTTACTTTGGTCACCCATGGGCATTTGCT  \n",
      "ATTGCAGTGACCGAGATTTGCCATCGAGCCTCCTTGGGAGCTTTCTTGCTGGCGATCTAAACCCTAGCCCGGC  \n",
      "GCAGTTTTGCGCCAAGTCATATGACACATAATTGGCGAAGGGGGCGGCATGGTGCCTTGACCCTCCCCAAATC  \n",
      "ATTTTTTTAACAACTCTCAGCAACGGAAGGGCACGCCTGCCTGGGCATTGCGAGTCATATCTCTCCCTTAATG  \n",
      "AGGCTGTCCACACATACT"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "HTML('<iframe src=http://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome width=1000 height=350></iframe>')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Let's take a look at the results.  \n",
      "First, we have a graphical representation of the best matches found in the database. Colors indicate the level of similarity.  \n",
      "Note that hits may contain more than one region of similarity. Each such region is called a high-scoring segment pair (__HSP__).  \n",
      "Then, a list of the hits follows. Each BLAST hit has its name, and several values describing the match:  \n",
      "- Max score - the score of the best HSP\n",
      "- Total score - sum of scores for all HSPs (will be equal to Max score if the hit is only one HSP)\n",
      "- Query cover - the percent of the query sequence covered by the hit\n",
      "- E-value - the number of hits with better score expected to be found in a randomized database. This is used as an indication of the statistical significance of the match. The lower the E-value is, the more significant the hit. E-value < 10<sup>-4</sup> are usually considered significant.\n",
      "- Ident - percent of sequence identity between the hit and the query"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Running BLAST from within python\n",
      "We can perform the same BLAST run using a dedicated module included in Biopython. This is done using the `qblast()` method from the `NCBIWWW` module. It takes three arguments:  \n",
      "- BLAST algorithm to use (_blastn,blastp,blastx, etc._), a string\n",
      "- Database to querry (all nucleotides, refseq, all protein, etc.), a string\n",
      "- Sequence - may be a string representing a sequence, a sequence object, a sequence-record object, a string representing sequence ID, or even a FASTA file."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from Bio.Blast import NCBIWWW\n",
      "seq = Seq(\"GTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCAGTTTACTTTGGTCACCCATGGGCATTTGCTATTGCAGTGACCGAGATTTGCCATCGAGCCTCCTTGGGAGCTTTCTTGCTGGCGATCTAAACCCTAGCCCGGCGCAGTTTTGCGCCAAGTCATATGACACATAATTGGCGAAGGGGGCGGCATGGTGCCTTGACCCTCCCCAAATCATTTTTTTAACAACTCTCAGCAACGGAAGGGCACGCCTGCCTGGGCATTGCGAGTCATATCTCTCCCTTAATGAGGCTGTCCACACATACT\")\n",
      "try:\n",
      "    result_handle = NCBIWWW.qblast(\"blastn\", \"nt\", seq)\n",
      "    print(result_handle)\n",
      "except:\n",
      "    print('BLAST run failed!')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Parsing BLAST results\n",
      "Once we have fetched the BLAST result, we would like to extract the information stored within it.  \n",
      "The first thing we do is to convert it into a usable format, using `NCBIXML.read()` on the result we retrieved."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from Bio.Blast import NCBIXML\n",
      "blast_record = NCBIXML.read(result_handle)\n",
      "print(blast_record)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "still not very helpful. We will have to iterate over the BLAST hits, using `.alignments`:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "for hit in blast_record.alignments:\n",
      "    print(hit)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "within each hit, we can iterate over the HSPs, using `.hsps`:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "for hit in blast_record.alignments:\n",
      "    print('***********************')\n",
      "    print(hit)\n",
      "    for hsp in hit.hsps:\n",
      "        print(hsp)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "HSPs contain lots of useful information, for example, its name, length and E-value.  \n",
      "Let's write a function that will print only matches with E-value smaller than a chosen threshold."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def significant_matches(query,e_value_thresh):\n",
      "    try:\n",
      "        result_handle = NCBIWWW.qblast(\"blastn\", \"nt\", query)\n",
      "    except:\n",
      "        print('BLAST run failed!')\n",
      "        return None\n",
      "    blast_record = NCBIXML.read(result_handle)\n",
      "    for hit in blast_record.alignments:\n",
      "        for hsp in hit.hsps:\n",
      "            if hsp.expect < e_value_thresh:\n",
      "                print(hit.title)\n",
      "                print(hsp.sbjct)\n",
      "                \n",
      "significant_matches(seq,e**(-100))                "
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## <span style=\"color:blue\">Class exercise 6C</span>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Write a function that receives a sequences, runs a standard BLAST search and returns the name of the longest hit (use the `.length` and `.title` commands) "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from Bio.Seq import Seq\n",
      "from Bio.Blast import NCBIWWW, NCBIXML\n",
      "def longest_blast_hit(seq):\n",
      "    result_handle = NCBIWWW.qblast(\"blastn\", \"nt\", seq)\n",
      "    blast_record = NCBIXML.read(result_handle)\n",
      "    longest = 0\n",
      "    for hit in blast_record.alignments:\n",
      "        if hit.length > longest:\n",
      "            longest = hit.length\n",
      "            name = hit.title\n",
      "    return name\n",
      "\n",
      "assert longest_blast_hit(seq).split('|')[1] == '47776119'"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Fin\n",
      "This notebook is part of the _Python Programming for Life Sciences Graduate Students_ course given in Tel-Aviv University, Spring 2015.\n",
      "\n",
      "The notebook was written using [Python](http://pytho.org/) 3.4.1 and [IPython](http://ipython.org/) 2.1.0 (download from [PyZo](http://www.pyzo.org/downloads.html)).\n",
      "\n",
      "The code is available at https://github.com//Py4Life/TAU2015/blob/master/lecture6.ipynb.\n",
      "\n",
      "The notebook can be viewed online at http://nbviewer.ipython.org/github//Py4Life/TAU2015/blob/master/lecture6.ipynb.\n",
      "\n",
      "This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.\n",
      "\n",
      "![Python logo](https://www.python.org/static/community_logos/python-logo.png)"
     ]
    }
   ],
   "metadata": {}
  }
 ]
}