{
 "metadata": {
  "name": "",
  "signature": "sha256:c337d58516ad1dd551e2d24e499e1b04af3ca0b2279c20f80d6ce6cb3dd332d8"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "[![Py4Life](https://raw.githubusercontent.com/Py4Life/TAU2015/gh-pages/img/Py4Life-logo-small.png)](http://py4life.github.io/TAU2015/)\n",
      "## Lecture 4 - 30.3.2015\n",
      "### Last update: 26.3.2015\n",
      "### Tel-Aviv University / 0411-3122 / Spring 2015\n",
      "\n",
      "This notebook is still a draft."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Previously on Py4Life\n",
      "\n",
      "- Lists\n",
      "- Dictionaries\n",
      "- Functions"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## In today's episode\n",
      "\n",
      "- Modules\n",
      "- Files I/O\n",
      "- The CSV format\n",
      "- File parsing\n",
      "- Regular expression"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Modules\n",
      "Last time, we learned how to write and use functions.  \n",
      "Luckily, we don't have to create all the code alone. Many people have written functions that perform various tasks.  \n",
      "These functions are grouped into packages, called _modules_, which are usually designed to address a specific set of tasks.  \n",
      "These functions are not always 'ready to use' by our code, but rather have to be _imported_ into it."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Let's start with an example. Suppose we want to do some trigonometry, we can (in principal) write our own _sin,cos,tan_ etc.. But it would be much easier to use the built-in _math_ module."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# first, we import the module\n",
      "import math"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now we can use values and functions within the math module"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "twice_pi = 2*math.pi\n",
      "print(twice_pi)\n",
      "radius = 2.3\n",
      "perimeter = twice_pi * radius\n",
      "print(perimeter)\n",
      "print(math.sin(math.pi/6))\n",
      "print(math.cos(math.pi/3))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "If you only need one or two functions from a module, you can import them, instead of the whole module. This way, we don't have to call the module name every time."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# import required functions from math\n",
      "from math import pi,sin,cos"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# now we can use values and functions within the math module\n",
      "twice_pi = 2*pi\n",
      "print(twice_pi)\n",
      "print(sin(pi/6))\n",
      "print(cos(pi/3))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "OK, cool, but how do I know which modules I need?"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "![Google](http://www.catonmat.net/blog/wp-content/uploads/2009/03/google-python-search-library.jpg)"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "You can also view the module documentation to see what functions are available and how to use them. Each python module has a documentation page, for example: https://docs.python.org/3/library/math.html  \n",
      "Two more useful links:  \n",
      "https://pypi.python.org/pypi - a list of all Python modules  \n",
      "https://wiki.python.org/moin/NumericAndScientific - a list of scientific modules"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Installing modules\n",
      "Many Python modules are included within the core distribution, and all you have to do is `import` them. However, many other modules need to be downloaded and installed first.  \n",
      "Python has built-in tools for installing modules, but sometimes things go wrong. Therefore, try the following methods, in this order.  "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#### 1. Use PIP\n",
      "PIP is a built-in program which (usually) makes it easy to install packages and modules.  \n",
      "Since we can't access PIP from within a notebook, we'll use the Pyzo IEP shell, usually located at _C:\\pyzo\\IEP.exe_. This interactive shell can run usefull commands. It looks something like this:  \n",
      "![IEP](lec4_files/IEP.jpg)"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We'll enter our commands in this shell window. For example, to get a list of all the modules already installed (not including built-in modules) and their versions, we can type: `pip freeze`:  \n",
      "![pip_freeze](lec4_files/pip_freeze.jpg)"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "To install a new package, all we have to do is: `pip install packagename`, and that's it. Just make sure there are no error messages raised during the installation.  \n",
      "If, for some reason, things don't work out that well, proceed to option 2."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#### 2. Use Conda\n",
      "Conda is another useful tool, rather similar to PIP. To use it, just type `conda` in the IEP shell. To get the list of installed modules, type `conda list`. To install a module, type `conda install packagename`.  \n",
      "If this too doesn't work, proceed to option 3."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#### 3. Use Windows binaries installation\n",
      "If nothing else works, you can try looking for your package [in this website](http://www.lfd.uci.edu/~gohlke/pythonlibs/). It contains many downloadable installers which you can just click through to easily install a package. Make sure to choose the download that fits your python version and operating system.  \n",
      "Not all modules are available through this website. If you don't find your module here, you might have to install from source. Details [here](https://docs.python.org/3/install/)"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Files I/O\n",
      "So far, we only used rather small data, like numbers, short strings and short lists. We stored these data in a local variable (i.e. in memory), and manipulated it. But what happens if we need to store large amounts of data?\n",
      "- Whole genomes\n",
      "- List of all insect species\n",
      "- Multiple numeric values  \n",
      "  \n",
      "This is what files are for!"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Why do we need files?\n",
      "- Store large amounts of data\n",
      "- Use data in multiple sessions\n",
      "- Use data outside python\n",
      "- Provide data for other tools/programs"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We'll start with simple text files and proceed to more complex formats.  \n",
      "Let's read the list of crop plants located in lec4_files/crops.txt"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Reading files\n",
      "Whenever we want to work with a file, we first need to _open_ it. This is, not surprisingly, done using the `open` function.  \n",
      "This function returns a file object which we can then use."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "crops_file = open('lec4_files/crops.txt','r')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The `open` function receives two parameters: the path to the file you want to open and the mode of opening (both strings). In this case - 'r' for 'read'.  \n",
      "Notice the / instead of \\ in the path. This is the easiest way to avoid path errors. Also note that this command alone does nothing, just creates the file object (sometimes called file handle).  \n",
      "In fact, we'll usually use the `open` function differently:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "with open('lec4_files/crops.txt','r') as crops_file:\n",
      "    # indented block\n",
      "    # do stuff with file\n",
      "    pass"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "OK, so what can we do with files?  \n",
      "The most common task would be to read the file line by line.  "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#### Looping over the file object\n",
      "We can simply use a _for_ loop to go over all lines. This is the best practice, and also very simple to use:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "with open('lec4_files/crops.txt','r') as crops_file:\n",
      "    for line in crops_file:\n",
      "        if line.startswith('Musa'):   # check if line starts with a given string\n",
      "            print(line)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Oops, why did we get double newlines?  \n",
      "Each line in the file ends with a _newline_ character. Although it is invisible in most editors, it is certainly there! In python, a newline is represented as `\\n`.  \n",
      "The `print()` command adds a new line to the newline character in the end of every line in the file, so we end up with double newlines.  \n",
      "We can use `strip()` to remove the character from the end of lines."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "with open('lec4_files/crops.txt','r') as crops_file:\n",
      "    for line in crops_file:\n",
      "        line = line.strip()\n",
      "        if line.startswith('Musa'):\n",
      "            print(line)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "![Musa species](http://www.replicatedtypo.com/wp-content/uploads/2010/08/Picture-49.png)"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#### Reading the entire file - read()\n",
      "Another option is to read the entire file as a big string with the `read()` method.  \n",
      "Careful with this one! This is not recommended for large files."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "with open('lec4_files/crops.txt','r') as crops_file:\n",
      "    entire_file = crops_file.read()\n",
      "    print(entire_file[:102]) # print first 102 characters"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#### Reading line by line with readline()\n",
      "The `readline()` method allows us to read a single line each time. It works very well when combined with a _while_ loop, giving us good control of the program flow."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "with open('lec4_files/crops.txt','r') as crops_file:\n",
      "    line = crops_file.readline()    # read first line\n",
      "    while line:\n",
      "        line = line.strip()\n",
      "        if line.startswith('Triticum'):\n",
      "            print(line)\n",
      "        line = crops_file.readline()    # read next line"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "__REMEMBER__ to always read the next line within the while loop. Otherwise, you'll get stuck in an infinite loop, processing the first line over and over again..."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "There are other methods you can use to read files. For example, take a look at the `readlines()` method here:  \n",
      "https://docs.python.org/3/tutorial/inputoutput.html"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Summary\n",
      "Whenever treating a file, there are three elements:\n",
      "- File __path__ - the actual location of the file on the hard drive (use `/` rather than `\\`).\n",
      "- File __object__ - the way files are handled in Python.\n",
      "- File __contents__ - what is extracted from the file, depending on the method used on the file object."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## <span style=\"color:blue\">Class exercise 4A</span>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Use one of the file-reading techniques shown above to:  \n",
      "1) Print the last line in the file.  \n",
      "2) Find out how many _Garcinia_ species are in the file (use the `startswith()` method)."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "with open('lec4_files/crops.txt','r') as crops_file:\n",
      "    entire_file = crops_file.read()\n",
      "    lines_list = entire_file.split('\\n')\n",
      "    print(lines_list[-1])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "with open('lec4_files/crops.txt','r') as crops_file:\n",
      "    triticum_count = 0\n",
      "    for line in crops_file:\n",
      "        if line.startswith('Garcinia'):\n",
      "            triticum_count += 1\n",
      "    print(triticum_count)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Writing to a file\n",
      "To write to a file, we first have to open it for writing. This is done using one of two modes: 'w' or 'a'.  \n",
      "'w', for write, will let you write into the file. If it doesn't exist, it'll be automatically created. If it exists and already has some content, __the content will be overwritten!__  \n",
      "'a', for append, is very similar, only it will not overwrite, but add your text to the end of an existing file. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "with open('lec4_files/output.txt','w') as out_file:\n",
      "    # indented block\n",
      "    # write into file...\n",
      "    pass"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Writing is done using good, old `print()`, only we add the argument `file = <file object>`."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "with open('lec4_files/output.txt','w') as out_file:\n",
      "    print('This is the first line', file=out_file)\n",
      "    line = 'Another line'\n",
      "    print(line, file=out_file)\n",
      "    seq1 = 'ATTAGCGGATA'\n",
      "    seq2 = 'GGCATATAT'\n",
      "    print(seq1 + seq2, file=out_file)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Parsing files\n",
      "Parsing is _\"the process of analyzing a string of symbols, either in natural language or in computer languages, conforming to the rules of a formal grammar.\"_ (definition from _Wikipedia_).  \n",
      "More simply, parsing is reading a file in a specific format, 'slurping' the data and storing it in a data structure of your choice (list, dictionary etc.). We can then use this structure to analyze, print or simply view the data in a certain way.  \n",
      "Each file format has its own set of 'rules', and therefore needs to be parsed in a tailored manner. Here we will see an example very relevant for biologists."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### The FASTA format"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "FASTA format is a text-based format for representing either nucleotide sequences or peptide sequences, in which nucleotides or amino acids are represented using single-letter codes. Each sequence has a header. Header lines start with '>'.  \n",
      "The file camelus.fasta includes five sequences of Camelus species. In this parsing example, we'll arrange the data in this file in a dictionary, so that the key is the id number from the header, and the value is the sequence.  "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from IPython.display import FileLink\n",
      "FileLink('lec4_files/camelus.fasta')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We'll start by writing the parsing function."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def parse_fasta(file_name):\n",
      "    \"\"\"\n",
      "    Receives a path to a fasta file, and returns a dictionary where the keys\n",
      "    are the sequence IDs and the values are the sequences.\n",
      "    \"\"\"\n",
      "    # create an empty dictionary to store the sequences\n",
      "    sequences = {}\n",
      "    # open fasta file for reading\n",
      "    with open(file_name,'r') as f:\n",
      "        # Loop over file lines\n",
      "        for line in f:\n",
      "            # if header line\n",
      "            if line.startswith('>'):\n",
      "                seq_id = line.split('|')[1]\n",
      "            # if sequence line\n",
      "            else:\n",
      "                seq = line.strip()\n",
      "                sequences[seq_id] = seq\n",
      "    return sequences"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now we can use the result. For example, let's print the first 10 nucleotides of every sequence."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "camelus_seq = parse_fasta('lec4_files/camelus.fasta')\n",
      "for seq_id in camelus_seq:\n",
      "    print(seq_id,\" - \",camelus_seq[seq_id][:10])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "![camelus](http://creagrus.home.montereybay.com/Camel_Oman-1.jpg)"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## <span style=\"color:blue\">Class exercise 4B</span>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Change the function above so that it takes the gb accession (e.g. EF471324.1) as key and 30 first nucleotides as value. Then use the output dictionary to print the results to a new file, in the following format:  \n",
      "EF471324.1: AGAGTCTTTGTAGTATATGGATTACGCTGG  \n",
      "EF471323.1: AGAGTCTTTGTAGTATATTGATTACGCTGG  \n",
      ".  \n",
      ".  \n",
      "."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# new function\n",
      "def parse_fasta_30_nuc(file_name):\n",
      "    \"\"\"\n",
      "    Receives a path to a fasta file, and returns a dictionary where the keys\n",
      "    are the sequence gb accession numbers and the values are the first 30\n",
      "    nucleotides of the sequences.\n",
      "    \"\"\"\n",
      "    # create an empty dictionary to store the sequences\n",
      "    sequences = {}\n",
      "    # open fasta file for reading\n",
      "    with open(file_name,'r') as f:\n",
      "        # Loop over file lines\n",
      "        for line in f:\n",
      "            # if header line\n",
      "            if line.startswith('>'):\n",
      "                gb = line.split('|')[3]\n",
      "            # if sequence line\n",
      "            else:\n",
      "                seq = line.strip()[:30]\n",
      "                sequences[gb] = seq\n",
      "    return sequences\n",
      "\n",
      "# parse file\n",
      "camelus_seq = parse_fasta_30_nuc('lec4_files/camelus.fasta')\n",
      "\n",
      "# write to new file\n",
      "with open('lec4_files/4b_output.txt','w') as of:\n",
      "    for gb_id in camelus_seq:\n",
      "        print(gb_id + ':',camelus_seq[gb_id], file=of)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Regular expressions"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Parsing files can sometimes be done using only the string class methods, as we did above. However, sometimes it can get tricky. Let's take an example."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### DNA patterns\n",
      "Suppose we have a DNA sequence in which we want to look for a specific pattern, say, 'TATAGGA'.  \n",
      "What do we do?  \n",
      "Easy, we use the `find` method."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "seq = \"ccgcaattcactctataggagcaggaacatggataaagctcacagtcgca\"\n",
      "if seq.find('tatagga') >= 0:\n",
      "    print('pattern found!')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "OK, but what if we need to look for a more flexible pattern, such as 'TATAGGN'?  \n",
      "We can do:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "if seq.find('tatagga') >= 0 or seq.find('tataggt') >= 0 or seq.find('tataggc') >= 0 or seq.find('tataggg') >= 0:\n",
      "    print('pattern found!')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "But that's lots of work and also, what if we need 'TATAGNN'?  \n",
      "There are too many combinations to cover manually!  \n",
      "What we need is a more general way of doing such matching. This is what __Regular expressions__ are for!"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### What are regular expressions?\n",
      "Regular expressions (regex) are sets of characters that represents a search pattern. It's like a specific language that was designed to tell us how a text string should look. It includes special symbols which allow us to depict flexible strings.  \n",
      "This is a very powerful tool when looking for patterns or parsing text.  \n",
      "We'll soon see what we can do with it and how to use it."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Using regular expressions\n",
      "In order to use regex, we need to use pythons built-in dedicated module. That means we don't have to install anything, just import the `re` module."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import re"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#### Raw strings\n",
      "We've already encountered some special characters, such as _\\n_ (newline) and _\\t_ (tab).  \n",
      "In regular expressions we want to avoid any confussion, and therefore use a special notation, telling python that we have no buisness with special characters here. We simply put an _r_ __outside__ the quotation marks. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "normal_string = \"There will be\\na new line\"\n",
      "raw_string = r\"There won't be\\na new line\"\n",
      "print(normal_string)\n",
      "print(raw_string)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "__ALWAYS use raw strings when working with regular expressions!__"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Searching for patterns\n",
      "This is the most basic task regex is used for. We just want to know if a pattern can be found within a string.  \n",
      "The first step when working with regex is always _compiling_. This means we transform a simple string such as 'tatagga' into a regex pattern. This is done using `re.compile()`"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "regex = re.compile(r'tatagga') # notice the 'r'"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We didn't match anything yet, just prepared the regex pattern. Once we have it, we can use it to seqrch within another string. For this we can use the `re.search()` method. It takes two parameters: regex and string to search ('target string') and returns _True_ if the pattern was found and _False_ otherwise."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "if re.search(regex,seq):\n",
      "    print('pattern found!')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#### Character groups\n",
      "The last example wasn't particularly useful, right?  \n",
      "OK, so here's when it gets interesting. We can define character groups within our regex, so that any of them will be matched. We do that using square brackets, and put all possible matches within them. So if we want to match 'TATAGGN' we'll do:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "regex = re.compile(r'tatagg[atgc]')\n",
      "if re.search(regex,seq):\n",
      "    print('pattern found!')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "if re.search(regex,\"tataggn\"):\n",
      "    print('pattern found!')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We can put any list of characters within the brackets. There are also a few tricks to make things easier:  \n",
      "* [0-9] - any digit\n",
      "* [a-z] - any letter\n",
      "* [a-p] - any letter between a and p\n",
      "  \n",
      "There are also special symbols for common groups:  \n",
      "* \\d - any digit (equivalent to [0-9])\n",
      "* \\w - any 'word' character - letters, digits and underscore (equivalent to [a-zA-Z0-9\\_)\n",
      "* \\s - any whitespace character - space, tab, newline and other weird stuff (equivalent to [ \\t\\n\\r\\f\\v])\n",
      "  \n",
      "And finally, there's the _wildcard_ symbol, represented by a dot (.).  \n",
      "This means any character (except for a newline).  \n",
      "__Careful with this one!__ It'll take almost anything, so use it wisely."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# examples:\n",
      "regex = re.compile(r'\\d[d-k][2-8].')\n",
      "if re.search(regex,'7f6,'):\n",
      "    print('pattern found!')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "if re.search(regex,'hello7f6world'):\n",
      "    print('pattern found!')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "if re.search(regex,'5l7o'):\n",
      "    print('pattern found!')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "if re.search(regex,'7f6'):\n",
      "    print('pattern found!')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#### Being negative\n",
      "Sometimes we want to tell python to search for 'anything but...'. We can do that in two ways:  \n",
      "If we are using character groups in square brackets, we can simply add a cadet (^) before the characters. For example `[^gnp%]` means 'match anything but 'g','n','p' or '%''. If we are using the special character groups, we can replace the symbol with a capital letter, so for example \\D means 'match anything but a digit'."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "regex = re.compile(r'AAT[^G]TAA')\n",
      "if re.search(regex,'AATCTAA'):\n",
      "    print('pattern found!')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "if re.search(regex,'AATGTAA'):\n",
      "    print('pattern found!')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "regex = re.compile(r'AAT\\STAA')\n",
      "if re.search(regex,'AATCTAA'):\n",
      "    print('pattern found!')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "if re.search(regex,'AATGTAA'):\n",
      "    print('pattern found!')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "if re.search(regex,'AAT TAA'):\n",
      "    print('pattern found!')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#### Alteration\n",
      "When we want to create multiple options for longer patterns, character groups are not enough. In these cases we have to use the special '|' (pipe) character, which simply means 'or'.  \n",
      "For example, if we want to match a pattern that starts with AGG, then either CCG __or__ TAG, and finally GTG, we can do: "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "regex = re.compile(r'AGG(CCG|TAG)GTG')\n",
      "if re.search(regex,'AGGTAGGTG'):\n",
      "    print('pattern found!')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "if re.search(regex,'AGGCCGGTG'):\n",
      "    print('pattern found!')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "if re.search(regex,'AGGCCTGTG'):\n",
      "    print('pattern found!')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#### Repetition\n",
      "In many cases, we want to write regular expressions where a part of the pattern repeats itself multiple times. For that, we use _quantifiers_.  \n",
      "If we know exactly how many repetitions we want, we can use `{<number>}`:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "regex = re.compile(r'GA{5}T')\n",
      "if re.search(regex,'GAAAAAT'):\n",
      "    print('pattern found!')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "if re.search(regex,'GAAAT'):\n",
      "    print('pattern found!')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We can also set an acceptable range of number of repeats, which is done using `{<minimum repeats>,<maximum repeats>`:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "regex = re.compile(r'GA{3,5}T')\n",
      "if re.search(regex,'GAAAAAT'):\n",
      "    print('pattern found!')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "if re.search(regex,'GAAAT'):\n",
      "    print('pattern found!')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "if re.search(regex,'GAAAAAAT'):\n",
      "    print('pattern found!')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "To say 'x or more repetitions', we use `{x,}`. For 'up to x repetitions', we can use `{0,x}`."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "For more general cases, there are three special symbols we can use:  \n",
      "- \\+ - repeat 1 or more times\n",
      "- \\* - repeat 0 or more times\n",
      "- ? - repeat 0 or 1 times, or in other words 'optional' character."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "regex = re.compile(r'GA+TT?[AC]*')\n",
      "if re.search(regex,'GAATTACCA'):\n",
      "    print('pattern found!')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "if re.search(regex,'GATACCA'):\n",
      "    print('pattern found!')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "if re.search(regex,'GTACCA'):\n",
      "    print('pattern found!')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "if re.search(regex,'GAAAAAAAT'):\n",
      "    print('pattern found!')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "__Note 1__: Quantifiers always refer to the character that appears right before them. This could be a normal character or a character group. If we want to indicate a repeat of several characters, we enclose them in ()."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "regex = re.compile(r'GGCG(AT)+GGG')\n",
      "if re.search(regex,'GGCGATATATATGGG'):\n",
      "    print('pattern found!')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "if re.search(regex,'GGCGATTAATGGG'):\n",
      "    print('pattern found!')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "regex = re.compile(r'GGCG(AT)?GGG')\n",
      "if re.search(regex,'GGCGATGGG'):\n",
      "    print('pattern found!')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "if re.search(regex,'GGCGGGG'):\n",
      "    print('pattern found!')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "__Note 2__: Whenever we want to match one of the special regex characters in its 'normal' context, we simply put a '\\' before it. For example: \\\\*, \\\\+, \\\\{..."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "regex = re.compile(r'.+\\{\\d+\\}\\.')\n",
      "sentence = 'A sentence that ends with number in curly brackets {345}.'\n",
      "if re.search(regex,sentence):\n",
      "    print('pattern found!')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## <span style=\"color:blue\">Class exercise 4D</span>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The code below includes a list of made-up gene names. Complete it to only print gene names that satisfy the following criteria:  \n",
      "1. Contain the letter 'd' __or__ 'e'  \n",
      "2. Contain the letter 'd' __and__ 'e', in that order (not necessarily in a row)\n",
      "3. Contain three or more digits in a row"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import re\n",
      "genes = ['xkn59438', 'yhdck2', 'eihd39d9', 'chdsye847', 'hedle3455', 'xjhd53e', '45da', 'de37dp','map492ty']\n",
      "\n",
      "# 1.\n",
      "print('Gene names containing d or e:')\n",
      "regex1 = re.compile(r'[de]')\n",
      "for gene in genes:\n",
      "    if re.search(regex1,gene):\n",
      "        print(gene)\n",
      "        \n",
      "print('------------------------')\n",
      "\n",
      "# 2.\n",
      "print('Gene names containing d and e, in that order:')\n",
      "regex2 = re.compile(r'd[^e]*e')\n",
      "for gene in genes:\n",
      "    if re.search(regex2,gene):\n",
      "        print(gene)\n",
      "        \n",
      "print('------------------------')\n",
      "\n",
      "# 3.\n",
      "print('Gene names containing three digits in a row:')\n",
      "regex3 = re.compile(r'\\d{3,}')\n",
      "for gene in genes:\n",
      "    if re.search(regex3,gene):\n",
      "        print(gene)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#### Enforcing positions\n",
      "We can enforce the a regex to match only the start or end of the input string. We do that by using the ^ and $ symbols, respectively."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "regex = re.compile(r'^my name')\n",
      "if re.search(regex,'my name is Slim Shady'):\n",
      "    print('pattern found!')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "if re.search(regex,'This is my name'):\n",
      "    print('pattern found!')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "regex = re.compile(r'my name$')\n",
      "if re.search(regex,'This is my name'):\n",
      "    print('pattern found!')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We can combine the start and end symbols to match a whole string:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "regex = re.compile(r'^GC[GTC]{2,10}TTA$')\n",
      "if re.search(regex,'GCTTCGCTTA'):\n",
      "    print('pattern found!')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "if re.search(regex,'GCTTCGCTTAG'):\n",
      "    print('pattern found!')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Extracting matches\n",
      "OK, now that we know the 'language' of regular expression, let's see another useful thing we can do with it.  \n",
      "So far, we only used regex to test if a string matches a pattern, but sometimes we also want to extract parts of the string for later use.  \n",
      "Let's take an example."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#### The GATA-4 Transcription factor\n",
      "GATA-4 is a TF in humans, known to have an important role in cardiac development (Oka, T., Maillet, M., Watt, A. J., Schwartz, R. J., Aronow, B. J., Duncan, S. A., & Molkentin, J. D. (2006). Cardiac-specific deletion of Gata4 reveals its requirement for hypertrophy, compensation, and myocyte viability. Circulation research, 98(6), 837-845.)  \n",
      "It is also known to bind the motif: AGATADMAGRSA (where M = A or C, D = A,G or T, R = A or G and S = C or G).  \n",
      "Using regex, it's easy to write a function that checks if a sequence includes this motif.\n",
      "![Motif](lec4_files/gata4.jpg)"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def check_for_GATA4(sequence):\n",
      "    motif_regex = re.compile(r'AGATA[AG][AC]AG[AG][CG]A')\n",
      "    if re.search(motif_regex,sequence):\n",
      "        return True\n",
      "    else:\n",
      "        return False"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "test_seq1 = 'AGAGTCTTTGAGATAGCAGACATAGTATATGGATTACGCTGGTCTTGTAAACCATAAAAGGAGAGCCACACTCTCCCTAAGACTCAGGGAAGAGGCCAAAGCCCCACCACCAGCACCCAAAGCTG'\n",
      "check_for_GATA4(test_seq1)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "test_seq2 = 'AGAGTCTTTGAGATAGTAGACATAGTATATGGATTACGCTGGTCTTGTAAACCATAAAAGGAGAGCCACACTCTCCCTAAGACTCAGGGAAGAGGCCAAAGCCCCACCACCAGCACCCAAAGCTG'\n",
      "check_for_GATA4(test_seq2)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "But what if we want to extract the actual sequence that matches the regex?  \n",
      "Let's have another look at the `re.search()` method. So far, we only used it to test if a match exists or not. But it actually returns something, which we can use to get the exact match, with the `group()` method.  \n",
      "This method is used on the search result to get the match. So the following function will return the actual match in the sequence, if one exists. Otherwise, it will return `None`."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def find_GATA4_motif(sequence):\n",
      "    motif_regex = re.compile(r'AGATA[AG][AC]AG[AG][CG]A')\n",
      "    result = re.search(motif_regex,sequence)   # notice the assignment here\n",
      "    if result is None:\n",
      "        return None\n",
      "    else:\n",
      "        return result.group()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print(find_GATA4_motif(test_seq1))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print(find_GATA4_motif(test_seq2))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Since most of the motif is fixed, we might only be interested in the 'ambiguous' parts (that is, the DM part and the RS part). We can _capture_ specific parts of the pattern by enclosing them with parentheses. Then we can extract them by giving the `group()` method an argument, where '1' means 'extract the first captured part', '2' means 'extract the second captured part' and so on. The following function will capture the ambiguous positions and return them as elements of a list."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def extract_ambiguous_for_GATA4(sequence):\n",
      "    motif_regex = re.compile(r'AGATA([AG])([AC])AG([AG])([CG])A') # notice the parentheses\n",
      "    result = re.search(motif_regex,sequence)\n",
      "    if result is None:\n",
      "        return None\n",
      "    else:\n",
      "        D = result.group(1)\n",
      "        M = result.group(2)\n",
      "        R = result.group(3)\n",
      "        S = result.group(4)\n",
      "        return [D,M,R,S]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "D,M,R,S = extract_ambiguous_for_GATA4(test_seq1)\n",
      "print('D nucleotide:',D)\n",
      "print('M nucleotide:',M)\n",
      "print('R nucleotide:',R)\n",
      "print('S nucleotide:',S)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### More on regular expression\n",
      "There are some other cool things we can do with regex, which we'll not discuss here:\n",
      "* Split strings by regex\n",
      "* Substitute parts of string using regex\n",
      "* Get the position in the string where a pattern was found  \n",
      "If you want to do any of these, take a look at the re module documentation  \n",
      "https://docs.python.org/3/library/re.html"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#### Recommended:\n",
      "The Regex Coach is a very useful software when dealing with more complex patterns. It lets you try your regular expressions interactively, see if they work and what parts are extracted. Download and more information [here](http://www.weitz.de/regex-coach/#install)."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## <span style=\"color:blue\">Class exercise 4E</span>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The 'GATA4_promoters.fasta' file includes (made-up) promoter sequences for genes suspected to be regulated by GATA-4.  \n",
      "We'll use everything we've learned so far to write a program that summarizes some interesting statistics regarding the GATA-4 motifs in these promoters.  \n",
      "First, let's adjust the parse\\_fasta() function we created earlier for the specif format of the promoters file:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def parse_promoters_fasta(file_name):\n",
      "    \"\"\"\n",
      "    Receives a path to a fasta file, and returns a dictionary where the keys\n",
      "    are the sequence names and the values are the sequences.\n",
      "    \"\"\"\n",
      "    # create an empty dictionary to store the sequences\n",
      "    sequences = {}\n",
      "    # open fasta file for reading\n",
      "    with open(file_name,'r') as f:\n",
      "        # Loop over file lines\n",
      "        for line in f:\n",
      "            # if header line\n",
      "            if line.startswith('>'):\n",
      "                seq_id = line[1:-1]   # take the whole line, except the '>' in the beginning and '\\n' at the end\n",
      "            # if sequence line\n",
      "            else:\n",
      "                seq = line.strip()\n",
      "                sequences[seq_id] = seq\n",
      "    return sequences"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "1)\n",
      "Write a function that receives a promoters fasta dictionary, and counts how many of the promoters have the GATA-4 motif. Use any of the functions defined above and complete the code:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def count_promoters_with_motif(promoters_dictionary):\n",
      "    \"\"\"\n",
      "    Receives a dictionary representing a promoters fasta file,\n",
      "    and counts how many of the promoters include a GATA-4 motif.\n",
      "    \"\"\"\n",
      "    promoters_count = 0   # store the number of promoters with GATA-4 motif\n",
      "    for p in promoters_dictionary:\n",
      "        if check_for_GATA4(promoters_dictionary[p]):\n",
      "            promoters_count += 1\n",
      "    return promoters_count"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "2) For promoters that do include the GATA-4 motif, we would like to know the frequencies of the different nucleotides for each of the four variable positions in the motif. Complete the code:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def get_positions_statistics(promoters_dictionary):\n",
      "    \"\"\"\n",
      "    Receives a dictionary representing a promoters fasta file,\n",
      "    and returns the frequencies of possible nucleotides in \n",
      "    each variable position.\n",
      "    \"\"\"\n",
      "    # define a  dictionary for each position, to store the nucleotide frequencies\n",
      "    # D position\n",
      "    D_dict = {'A':0, 'G':0, 'T':0}\n",
      "    # M position\n",
      "    M_dict = {'A':0, 'C':0}\n",
      "    # R position\n",
      "    R_dict = {'A':0, 'G':0}\n",
      "    # S position\n",
      "    S_dict = {'C':0, 'G':0}\n",
      "    \n",
      "    # itterate over promoters\n",
      "    for p in promoters_dictionary:\n",
      "        # if promoter includes the GATA-4 motif\n",
      "        if check_for_GATA4(promoters_dictionary[p]):\n",
      "            # get variable nucleotides in promoter\n",
      "            D,M,R,S = extract_ambiguous_for_GATA4(promoters_dictionary[p])\n",
      "            # insert to dictionaries\n",
      "            D_dict[D] += 1\n",
      "            M_dict[M] += 1\n",
      "            R_dict[R] += 1\n",
      "            S_dict[S] += 1\n",
      "            \n",
      "    return D_dict, M_dict, R_dict, S_dict"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "3) Now, we just have to write a function that will summarize the results in a CSV file. It should receive the frequencies dictionaries and write statistics to an output file. Complete the code:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def summarize_results(D_dict, M_dict, R_dict, S_dict, output_file):\n",
      "    with open(output_file, 'w') as fo:\n",
      "        csv_writer = csv.writer(fo)\n",
      "        # write headers line\n",
      "        csv_writer.writerow(['Position','A','G','C','T'])\n",
      "        # summarize D position\n",
      "        csv_writer.writerow(['D',D_dict['A'],D_dict['G'],0,D_dict['T']])\n",
      "        # summarize M position\n",
      "        csv_writer.writerow(['M',M_dict['A'],0,M_dict['C'],0])\n",
      "        # summarize R position\n",
      "        csv_writer.writerow(['R',R_dict['A'],R_dict['G'],0,0])\n",
      "        # summarize S position\n",
      "        csv_writer.writerow(['S',0,S_dict['G'],S_dict['C'],0])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "4) Now that we have all the functions ready, we can write the main program. Complete the code:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import csv\n",
      "promoters_file = \"lec4_files/GATA4_promoters.fasta\"\n",
      "output_file = \"lec4_files/promoters_stats.csv\"\n",
      "\n",
      "# parse fasta file\n",
      "promoters_dict = parse_promoters_fasta(promoters_file)\n",
      "\n",
      "# Count promoters with/without GATA-4 motif\n",
      "promoters_with_motif = count_promoters_with_motif(promoters_dict)\n",
      "promoters_without_motif = len(promoters_dict) - promoters_with_motif\n",
      "print('Total promoters:',promoters_with_motif + promoters_without_motif)\n",
      "print('Promoters with GATA-4 motif:',promoters_with_motif)\n",
      "print('Promoters without GATA-4 motif:',promoters_without_motif)\n",
      "\n",
      "# Get statistics\n",
      "D_dict, M_dict, R_dict, S_dict = get_positions_statistics(promoters_dict)\n",
      "# write to CSV\n",
      "summarize_results(D_dict, M_dict, R_dict, S_dict,output_file)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### The CSV format\n",
      "Comma separated values (CSV) is a very common and useful format for storing tabular data. It is similar to an Excel file, only it is completely text based. Let's have a look at an example file, both using Excel and a simple text editor."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We can, quite easily, create our own functions for dealing with CSV files, for example by splitting each line by commas. However, Python has a built-in module for exactly this purpose, so why bother?"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#### Reading CSV files\n",
      "The most simple way to read a CSV file is to use the modules `reader` function. This function receives a file object (created with `open()`) and returns a reader object."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import csv"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "experiments_file = 'lec4_files/electrolyte_leakage.csv'\n",
      "with open(experiments_file, 'r') as f:\n",
      "    csv_reader = csv.reader(f)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Once we have defined the csv reader, we can use it to iterate over the file lines. Each row is returned as a list of the column values."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "experiments_file = 'lec4_files/electrolyte_leakage.csv'\n",
      "with open(experiments_file, 'r') as f:\n",
      "    csv_reader = csv.reader(f)\n",
      "    for row in csv_reader:\n",
      "        print(row[0])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#### Writing CSV files\n",
      "Writing is also rather straightforward. The csv module supplies the `csv.writer` object, which has the method `writerow()`. This function receives a list, and prints it as a csv line."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "new_file = 'lec4_files/out_csv.csv'\n",
      "with open(new_file, 'w', newline='') as fo:    # notice the 'w' instead of 'r'\n",
      "    csv_writer = csv.writer(fo)\n",
      "    csv_writer.writerow(['these','are','the','column','headers'])\n",
      "    csv_writer.writerow(['and','these','are','the','values'])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## <span style=\"color:blue\">Class exercise 4C</span>\n",
      "The electrolyte_leakage.csv file depicts the results of experiments on different Arabidopsis ecotypes (accessions). In each row, there are 3 control plants and 3 plants tested under draught stress.  \n",
      "Read the CSV file, calculate the mean result for control and for test plants of each ecotype, and print the result as a new CSV file, in the following way:  \n",
      "  \n",
      "Accession  |  control mean  |  test mean  \n",
      "101AV/Ge-0 |      7.34      |     3.03  \n",
      "157AV/Ita-0|     16.85      |     2.92  \n",
      ".  \n",
      ".  \n",
      ".  \n",
      "  \n",
      "Use the provided accessory function to calculate means.  \n",
      "Try opening the output file in Excel to make sure your code works propperly."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def mean_of_string_values(lst):\n",
      "    \"\"\"\n",
      "    receives a list of strings representing numbers and returns their mean\n",
      "    \"\"\"\n",
      "    numeric_lst = []\n",
      "    for x in lst:\n",
      "        numeric_lst.append(float(x))\n",
      "    return mean(numeric_lst)\n",
      "\n",
      "experiments_file = 'lec4_files/electrolyte_leakage.csv'\n",
      "with open(experiments_file, 'r') as f:\n",
      "    with open('lec4_files/4c_output.csv','w', newline='') as fo:\n",
      "        csv_writer = csv.writer(fo)\n",
      "        csv_writer.writerow(['Accession','control mean','test mean'])\n",
      "        csv_reader = csv.reader(f)\n",
      "        next(csv_reader)\n",
      "        for row in csv_reader:\n",
      "            acc = row[0]\n",
      "            control = row [1:4]\n",
      "            test = row[4:]\n",
      "            to_write = [acc,mean_of_string_values(control),mean_of_string_values(test)]\n",
      "            csv_writer.writerow(to_write)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Fin\n",
      "This notebook is part of the _Python Programming for Life Sciences Graduate Students_ course given in Tel-Aviv University, Spring 2015.\n",
      "\n",
      "The notebook was written using [Python](http://pytho.org/) 3.4.1 and [IPython](http://ipython.org/) 2.1.0 (download from [PyZo](http://www.pyzo.org/downloads.html)).\n",
      "\n",
      "The code is available at https://github.com//Py4Life/TAU2015/blob/master/lecture4.ipynb.\n",
      "\n",
      "The notebook can be viewed online at http://nbviewer.ipython.org/github//Py4Life/TAU2015/blob/master/lecture4.ipynb.\n",
      "\n",
      "The notebook is also available as a PDF at https://github.com//Py4Life/TAU2015/blob/master/lecture4.pdf?raw=true.\n",
      "\n",
      "This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.\n",
      "\n",
      "![Python logo](https://www.python.org/static/community_logos/python-logo.png)"
     ]
    }
   ],
   "metadata": {}
  }
 ]
}