{
 "metadata": {
  "name": "",
  "signature": "sha256:f9a1bc5356d9952069906efcfbb78df69d2832ccdcea6778b345ee99c2401fcc"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "This illustrates how to use global alignment to find a short sequence (e.g., a PCR primer) in a longer sequence (e.g., a 454 read) using Needleman-Wunsch alignment."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from skbio.alignment import global_pairwise_align_nucleotide"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 1
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "s = \"ACCGTGGACCGTTAGGATTGGACCCAAGGTTG\"\n",
      "t = \"T\"*25 + \"ACCGTGGACCGTAGGATTGGACCAAGGTTA\" + \"A\"*25\n",
      "\n",
      "a = global_pairwise_align_nucleotide(s, t)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stderr",
       "text": [
        "/Users/caporaso/Dropbox/code/scikit-bio/skbio/alignment/_pairwise.py:594: UserWarning: make_identity_substitution_matrix is deprecated and will soon be replaced, though at the time of this writing the new name has not been finalized. Updates will be posted to issue #161: https://github.com/biocore/scikit-bio/issues/161\n",
        "  warn(\"make_identity_substitution_matrix is deprecated and will soon be \"\n",
        "/Users/caporaso/Dropbox/code/scikit-bio/skbio/alignment/_pairwise.py:540: EfficiencyWarning: You're using skbio's python implementation of Needleman-Wunsch alignment. This is known to be very slow (e.g., thousands of times slower than a native C implementation). We'll be adding a faster version soon (see https://github.com/biocore/scikit-bio/issues/254 to track progress on this).\n",
        "  \"to track progress on this).\", EfficiencyWarning)\n"
       ]
      }
     ],
     "prompt_number": 2
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print a.to_fasta()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        ">0\n",
        "-------------------------ACCGTGGACCGTTAGGATTGGACCCAAGGTTG-------------------------\n",
        ">1\n",
        "TTTTTTTTTTTTTTTTTTTTTTTTTACCGTGGACCGT-AGGATTGGACC-AAGGTTAAAAAAAAAAAAAAAAAAAAAAAAAA\n",
        "\n"
       ]
      }
     ],
     "prompt_number": 3
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We can then slice just the targeted region as follows. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "gap_vector = a[0].gap_vector()\n",
      "start_index = gap_vector.index(False)\n",
      "end_index = (a.sequence_length() - 1) - gap_vector[::-1].index(False)\n",
      "print a[0][start_index:end_index+1]\n",
      "print a[1][start_index:end_index+1]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "ACCGTGGACCGTTAGGATTGGACCCAAGGTTG\n",
        "ACCGTGGACCGT-AGGATTGGACC-AAGGTTA\n"
       ]
      }
     ],
     "prompt_number": 4
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "And count the mismatches and gaps."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "mismatch_count = 0\n",
      "for i in range(start_index, end_index+1):\n",
      "    if a[0][i] != a[1][i]:\n",
      "        mismatch_count += 1\n",
      "print mismatch_count"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "3\n"
       ]
      }
     ],
     "prompt_number": 5
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "This is a useful example for illustrating the problem that arises when ``penalize_terminal_gaps == True``.  This issue arises because if you continue to penalize adding gaps to ``s1`` after you've reached the end, the score get so low that it can be better than or equal to the following alignment. This is clearly not the result that we're looking for, so ``penalize_terminal_gaps`` is ``False`` for the global aligners. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print global_pairwise_align_nucleotide(s, t, penalize_terminal_gaps=True).to_fasta()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        ">0\n",
        "-------------------------ACCGTGGACCGTTAGGATTGGACCCAAGGTT-------------------------G\n",
        ">1\n",
        "TTTTTTTTTTTTTTTTTTTTTTTTTACCGTGGACCGT-AGGATTGGACC-AAGGTTAAAAAAAAAAAAAAAAAAAAAAAAAA\n",
        "\n"
       ]
      }
     ],
     "prompt_number": 6
    }
   ],
   "metadata": {}
  }
 ]
}