{
 "metadata": {
  "name": "",
  "signature": "sha256:2e8c6fcca04def117ab8b5fe176232282ec9f277bd493025b345f6b1326b0709"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## [Python Programming for Biologists, Tel-Aviv University / 0411-3122 / Spring 2015](http://py4life.github.io/TAU2015/)\n",
      "# Homework 3\n",
      "\n",
      "## 1) Translating DNA\n",
      "In the code below there is a dictionary (named `codon_table`) in which keys represent codons and values represent corresponding amino acids. \n",
      "\n",
      "Write a program that will translate a DNA sequence into an amino acid sequence using the codons disctionary. Print out the result. Note that `*` are stop codons.  \n",
      "\n",
      "If you want to know more about how the codons dictionary was created, read the documentation for [list comprehension](https://docs.python.org/3.4/tutorial/datastructures.html#list-comprehensions) and the built-in [zip-function](https://docs.python.org/3.4/library/functions.html#zip)."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Create codons dictionary\n",
      "bases = ['t', 'c', 'a', 'g']\n",
      "codons = [a+b+c for a in bases for b in bases for c in bases]\n",
      "amino_acids = 'FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG'\n",
      "codon_table = dict(zip(codons, amino_acids))\n",
      "\n",
      "DNA = 'atgattccaacgcgaaggtcaagtacgtacagctctcagtgtgtgctactcaccgactccgtcatagcaaccggcgtcgtggtcgttaccattgcataa'\n",
      "\n",
      "# translate the sequence\n",
      "prot = \"\"\n",
      "for i in range(0,len(DNA) - 2,3):\n",
      "    codon = DNA[i:i+3]\n",
      "    amino_acid = codon_table[codon]\n",
      "    prot = prot + amino_acid\n",
      "\n",
      "print(prot)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "MIPTRRSSTYSSQCVLLTDSVIATGVVVVTIA*\n"
       ]
      }
     ],
     "prompt_number": 1
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## 2) Protein contents\n",
      "\n",
      "**a)** Write a function that __receives__ an amino acid sequence as string and __returns__ a dictionary where the keys are the amino acid residues and the values are the number of times each residue appeared in the protein. For example, the expected result for the peptide `LLTDSGT` is: `{'L': 2, 'T': 2, 'D': 1, 'S': 1, 'G': 1}`.  \n",
      "Test your function on the provided sequences, and print the results in the following format:  \n",
      "L - 2  \n",
      "T - 2  \n",
      "D - 1  \n",
      "S - 1  \n",
      "G - 1\n",
      "\n",
      "Remember: `dict` is unordered."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def count_residues(protein_seq):\n",
      "    counts_dict = {}   # create an empty dictionary to store results\n",
      "    for aa in protein_seq:\n",
      "        # if amino acid has already been added to the dictionary, increase its count by 1\n",
      "        if aa in counts_dict:\n",
      "            counts_dict[aa] += 1\n",
      "        # if amino acid found for the first time, add it with count 1\n",
      "        else:\n",
      "            counts_dict[aa] = 1\n",
      "    return counts_dict\n",
      "    \n",
      "protein_sequence = 'DQHTWMYAEGYLNHVYRCDKQRAEDKECNGLYAWALALESHGKGSYYCQGFKTFPNPWPMHMMTFVMADLYQYMEI'\n",
      "aa_counts_dict = count_residues(protein_sequence)\n",
      "# print results\n",
      "for aa in aa_counts_dict:\n",
      "    print(aa,'-',aa_counts_dict[aa])\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "F - 3\n",
        "W - 3\n",
        "H - 4\n",
        "I - 1\n",
        "K - 4\n",
        "Q - 4\n",
        "P - 3\n",
        "L - 5\n",
        "G - 5\n",
        "N - 3\n",
        "A - 6\n",
        "E - 5\n",
        "C - 3\n",
        "D - 4\n",
        "Y - 8\n",
        "V - 2\n",
        "R - 2\n",
        "S - 2\n",
        "T - 3\n",
        "M - 6\n"
       ]
      }
     ],
     "prompt_number": 2
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**b)** Write a function that receives an amino acid sequence as a string and returns a dictionary with the _frequencies_ of hydrophobic, posituvely-charged, negatively-charged, polar an other amino acids. Use the strings provided in the code below. \n",
      "\n",
      "For example,\n",
      "```\n",
      "residues_type_frequencies('LLTDSGT')\n",
      "{'hydrophobic': 0.286, 'positive': 0, 'negative': 0.143, 'polar': 0.428, 'other': 0.143}\n",
      "```\n",
      "\n",
      "Test your function on the provided amino acid sequence, and print the results in the following format:\n",
      "```\n",
      "hydrophobic - 0.286  \n",
      "positive - 0  \n",
      "negative - 0.143  \n",
      "polar - 0.428  \n",
      "other - 0.143\n",
      "```"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def residues_type_frequencies(protein_seq):\n",
      "    hydrophobic = ['A','V','I','L','M','F','Y','W']\n",
      "    pos_charged = ['R','H','K']\n",
      "    neg_charged = ['D','E']\n",
      "    polar = ['S','T','N','Q']\n",
      "    other = ['C','U','G','P']\n",
      "    \n",
      "    prot_length = len(protein_seq)\n",
      "    aa_types_dict = {'hydrophobic': 0, 'positive': 0, 'negative': 0, 'polar': 0, 'other': 0}\n",
      "    # first, count amino acids in each category\n",
      "    for aa in protein_seq:\n",
      "        if aa in hydrophobic:\n",
      "            aa_types_dict['hydrophobic'] += 1\n",
      "        elif aa in pos_charged:\n",
      "            aa_types_dict['positive'] += 1\n",
      "        elif aa in neg_charged:\n",
      "            aa_types_dict['negative'] += 1\n",
      "        elif aa in polar:\n",
      "            aa_types_dict['polar'] += 1\n",
      "        elif aa in other:\n",
      "            aa_types_dict['other'] += 1\n",
      "    # now, convert to frequencies\n",
      "    for aa_type in aa_types_dict:\n",
      "        aa_types_dict[aa_type] = aa_types_dict[aa_type]/prot_length\n",
      "    return aa_types_dict\n",
      "    \n",
      "    \n",
      "protein_sequence = 'DQHTWMYAEGYLNHVYRCDKQRAEDKECNGLYAWALALESHGKGSYYCQGFKTFPNPWPMHMMTFVMADLYQYMEI'\n",
      "aa_types_freq_dict = residues_type_frequencies(protein_sequence)\n",
      "# print results\n",
      "for aa_type in aa_types_freq_dict:\n",
      "    print(aa_type,'-',aa_types_freq_dict[aa_type])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "other - 0.14473684210526316\n",
        "polar - 0.15789473684210525\n",
        "negative - 0.11842105263157894\n",
        "positive - 0.13157894736842105\n",
        "hydrophobic - 0.4473684210526316\n"
       ]
      }
     ],
     "prompt_number": 4
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## 3) Palindromic sequences\n",
      "A palindromic sequence is a DNA sequence that is the same whether read 5' to 3' on one strand or 5' to 3' on the complementary strand. For example, the sequence 5' GAATTC 3' is palindromic, since the complement strand is 3' CTTAAG 5', or 5' GAATTC 3'.   \n",
      "Palindromic sequences are biologically interesting because they can form special structural motifs, such as hairpins, and often are cutting sites for restriction enzymes.  \n",
      "\n",
      "**a)** Write a function `is_palindrome` that receives a DNA sequence as a string and returns `True` if it is palindromic and `False` otherwise. You may use the function defined in the lecture to find the complement strand. The assertions test your function on the provided sequences."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def complement(sequence):\n",
      "    transcript_dict = {'A': 'T', 'T': 'A', 'G': 'C', 'C': 'G'}\n",
      "    complement = ''\n",
      "    for base in sequence:\n",
      "        complement += transcript_dict[base]\n",
      "    return complement\n",
      "\n",
      "def reverse_sequence(sequence):\n",
      "    reversed_seq = ''\n",
      "    seq_as_list = list(sequence)\n",
      "    for base in reversed(seq_as_list):\n",
      "        reversed_seq += base\n",
      "    return reversed_seq\n",
      "\n",
      "def reverse_complement(sequence):\n",
      "    complement_seq = complement(sequence)\n",
      "    reverse_complement = reverse_sequence(complement_seq)\n",
      "    return reverse_complement\n",
      "\n",
      "\n",
      "def is_palindrome(seq):\n",
      "    if seq == reverse_complement(seq):\n",
      "        return True\n",
      "    else:\n",
      "        return False\n",
      "    \n",
      "assert(is_palindrome('GAATTC'))\n",
      "assert(is_palindrome('GATATC'))\n",
      "assert(is_palindrome('AGCTTCTAGTCGACTAGAAGCT'))\n",
      "assert(not is_palindrome('GAACTC'))\n",
      "assert(not is_palindrome('GATATG'))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 5
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**b)** Now use the function `is_palindrome` to look for palindromic subsequences within a given DNA sequence.  \n",
      "\n",
      "Write a function `find_palindromes` that receives two parameters: a sequence `seq` (string) and an integer `n`. The function searches `seq` for n bases long palindromic subsequences. It returns a list of all the palindromic subsequences found. If none were found, it return an empty list. Implement the function using a __for__ loop."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def find_palindromes(S, n):\n",
      "    palindromes = []\n",
      "    i = 0\n",
      "    for i in range(0,len(S)-n+1):\n",
      "        subseq = S[i:i+n]\n",
      "\n",
      "        if is_palindrome(subseq):\n",
      "            palindromes.append(subseq) \n",
      "            \n",
      "    return palindromes\n",
      "\n",
      "DNA_seq = 'GGAGCTCCCAAAGCCATCAATATTCATCAAAACGAATTCAACGGAGCTCGATATCGCATCGCAAAAGACACC'\n",
      "palindromic_sequences = find_palindromes(DNA_seq,6)\n",
      "assert palindromic_sequences == ['GAGCTC', 'AATATT', 'GAATTC', 'GAGCTC', 'GATATC']"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 6
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**c)** Implement the same function using a __while__ loop."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def find_palindromes(S, n):\n",
      "    palindromes = []\n",
      "    i = 0\n",
      "    while i < len(S)-n+1:\n",
      "        subseq = S[i:i+n]\n",
      "        if is_palindrome(subseq):\n",
      "            palindromes.append(subseq) \n",
      "            \n",
      "        i += 1     \n",
      "                       \n",
      "    return palindromes\n",
      "\n",
      "DNA_seq = 'GGAGCTCCCAAAGCCATCAATATTCATCAAAACGAATTCAACGGAGCTCGATATCGCATCGCAAAAGACACC'\n",
      "palindromic_sequences = find_palindromes(DNA_seq,6)\n",
      "assert palindromic_sequences == ['GAGCTC', 'AATATT', 'GAATTC', 'GAGCTC', 'GATATC']"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 7
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "A major caveat of the functions created so far is that they will return all palindromic sequences, even if they are overlapping, which makes no biological sense. For example, if we search the sequence `GAATTCGAACAT` for 6-bases long palindromes, we will get both `GAATTC` and `TTCGAA`, although they are overlapping.  \n",
      "\n",
      "**d)** Choose one of the implementations from parts **b** and **c**, and change it so that no overlapping palindromes will be found. The function should return the upstream palindromes where there is an overlap."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def find_palindromes_no_overlap(S, n):\n",
      "    palindromes = []\n",
      "    i = 0\n",
      "    while i < len(S)-n+1:\n",
      "        subseq = S[i:i+n]\n",
      "        if is_palindrome(subseq):\n",
      "            palindromes.append(subseq) \n",
      "            i += n\n",
      "        else:    \n",
      "             i += 1\n",
      "                       \n",
      "    return palindromes \n",
      "  \n",
      "# test\n",
      "DNA_seq = 'GGAGCTCCCAAAGCCATCAATATTCATCAAAACGAATTCAACGGAGCTCGATATCGCATCGCAAAAGACACC'\n",
      "palindromic_sequences = find_palindromes_no_overlap(DNA_seq,6)\n",
      "assert palindromic_sequences == ['GAGCTC', 'AATATT', 'GAATTC', 'GAGCTC', 'GATATC']\n",
      "DNA_seq = 'GGAGCTCCCAAAGCCATCAGAATTCGAACATATCGCAAAAGACACC'\n",
      "palindromic_sequences = find_palindromes(DNA_seq,6)\n",
      "assert palindromic_sequences == ['GAGCTC', 'GAATTC', 'TTCGAA']\n",
      "DNA_seq = 'GGAGCTCCCAAAGCCATCAGAATTCGAACATATCGCAAAAGACACC'\n",
      "palindromic_sequences = find_palindromes_no_overlap(DNA_seq,6)\n",
      "assert palindromic_sequences == ['GAGCTC', 'GAATTC']"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 8
    }
   ],
   "metadata": {}
  }
 ]
}