{
 "metadata": {
  "name": "",
  "signature": "sha256:2ddb0dadd4bf9e57af676ce1026f65f6f4c1ce41f6aafef84d78a0ce9697f4e7"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## [Python Programming for Biologists, Tel-Aviv University / 0411-3122 / Spring 2015](http://py4life.github.io/TAU2015/)\n",
      "# Homework 5"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Analyzing the human mitochondrial genome\n",
      "In this exercise, we will process and analyze the human mitochondrial genome using Biopython.  \n",
      "All the information that will be used is stored in GenBank record NC_012920. It is recommended that you [inspect it](http://www.ncbi.nlm.nih.gov/nuccore/251831106) before starting to work."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### 1) Fetching the GenBank record\n",
      "Write a function that receives a GenBank id (such as NC_012920) and returns a Biopython SeqRecord object of the corresponding result. Use it to fetch the human mitochondrial genome record. Assume the default settings, as shown in class. Ignore any warning messages from NCBI that might be displayed."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from Bio import Entrez\n",
      "from Bio import SeqIO\n",
      "\n",
      "def fetch_gb_by_id(rec_id):\n",
      "    handle = Entrez.efetch(db=\"nucleotide\", rettype=\"gb\", retmode=\"text\", id=rec_id)\n",
      "    gb_record = SeqIO.read(handle, \"gb\")\n",
      "    handle.close()\n",
      "    return gb_record\n",
      "\n",
      "mito = fetch_gb_by_id('NC_012920')\n",
      "assert mito.description == 'Homo sapiens mitochondrion, complete genome.'"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stderr",
       "text": [
        "C:\\pyzo2015a\\lib\\site-packages\\Bio\\Entrez\\__init__.py:455: UserWarning: \n",
        "Email address is not specified.\n",
        "\n",
        "To make use of NCBI's E-utilities, NCBI requires you to specify your\n",
        "email address with each request.  As an example, if your email address\n",
        "is A.N.Other@example.com, you can specify it as follows:\n",
        "   from Bio import Entrez\n",
        "   Entrez.email = 'A.N.Other@example.com'\n",
        "In case of excessive usage of the E-utilities, NCBI will attempt to contact\n",
        "a user at the email address provided before blocking access to the\n",
        "E-utilities.\n",
        "  E-utilities.\"\"\", UserWarning)\n"
       ]
      }
     ],
     "prompt_number": 1
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### 2) Extracting the genes\n",
      "The record fetched in section 1 is the complete sequence of the mitochondrial genome. We will now extract the genes from this sequence.  \n",
      "A sequence record includes many _features_, such as CDS, STS, genes and other. To access the features of the `mito` SeqRecord, simply use `mito.features`.  \n",
      "Let's explore the features field of the record and see some examples by running the following code:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "mito_features = mito.features\n",
      "print(type(mito_features))\n",
      "print(type(mito_features[0]))\n",
      "print('************')\n",
      "print(mito_features[0])\n",
      "print('************')\n",
      "print(mito_features[1])\n",
      "print('************')\n",
      "print(mito_features[2])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "<class 'list'>\n",
        "<class 'Bio.SeqFeature.SeqFeature'>\n",
        "************\n",
        "type: source\n",
        "location: [0:16569](+)\n",
        "qualifiers:\n",
        "    Key: country, Value: ['United Kingdom: Great Britain']\n",
        "    Key: db_xref, Value: ['taxon:9606']\n",
        "    Key: isolation_source, Value: ['caucasian']\n",
        "    Key: mol_type, Value: ['genomic DNA']\n",
        "    Key: note, Value: ['this is the rCRS']\n",
        "    Key: organelle, Value: ['mitochondrion']\n",
        "    Key: organism, Value: ['Homo sapiens']\n",
        "    Key: tissue_type, Value: ['placenta']\n",
        "\n",
        "************\n",
        "type: D-loop\n",
        "location: join{[0:576](-), [16023:16569](-)}\n",
        "qualifiers:\n",
        "Sub-Features\n",
        "type: D-loop\n",
        "location: [16023:16569](-)\n",
        "qualifiers:\n",
        "\n",
        "type: D-loop\n",
        "location: [0:576](-)\n",
        "qualifiers:\n",
        "\n",
        "\n",
        "************\n",
        "type: gene\n",
        "location: [576:647](+)\n",
        "qualifiers:\n",
        "    Key: db_xref, Value: ['GeneID:4558', 'HGNC:HGNC:7481', 'MIM:590070']\n",
        "    Key: gene, Value: ['TRNF']\n",
        "    Key: nomenclature, Value: ['Official Symbol: MT-TF | Name: mitochondrially encoded tRNA phenylalanine | Provided by: HGNC:HGNC:7481']\n",
        "\n"
       ]
      }
     ],
     "prompt_number": 2
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "You can learn more about SeqRecord features by visiting the dedicated [documentation page](http://biopython.org/DIST/docs/_api_161/Bio.SeqFeature.SeqFeature-class.html).  \n",
      "We can see that there are different types of features, and that each one contains several fields of information. We will only need the `type`, the `location` and the `qualifiers` fields, which We can access like this:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "gene = mito.features[2]\n",
      "print(gene.type)\n",
      "print(gene.location)\n",
      "print(gene.qualifiers)\n",
      "print(type(gene.qualifiers))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "gene\n",
        "[576:647](+)\n",
        "{'gene': ['TRNF'], 'nomenclature': ['Official Symbol: MT-TF | Name: mitochondrially encoded tRNA phenylalanine | Provided by: HGNC:HGNC:7481'], 'db_xref': ['GeneID:4558', 'HGNC:HGNC:7481', 'MIM:590070']}\n",
        "<class 'dict'>\n"
       ]
      }
     ],
     "prompt_number": 3
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "__a)__ Write a function that receives a SeqRecord and extract its __gene__ features. The function should return a list of dictionaries, where each dictionary represents one gene. Gene dictionaries will have three keys: `start` (start position), `end` (end position) and `name` (gene name). Complete the code below and use it on the `mito` SeqRecord."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def extract_genes(gb_record):\n",
      "    genes = []\n",
      "    for feat in gb_record.features:\n",
      "        # if feature is gene\n",
      "        if feat.type == 'gene':\n",
      "            gene_dict = {}\n",
      "            try:\n",
      "                # get start and end positions\n",
      "                location = feat.location\n",
      "                start = location.start\n",
      "                end = location.end\n",
      "                gene_name = feat.qualifiers['gene'][0]\n",
      "                gene_dict['start'] = start\n",
      "                gene_dict['end'] = end\n",
      "                gene_dict['name'] = gene_name\n",
      "            except:\n",
      "                print('Failed to obtain stats for feature')\n",
      "                print(feat.qualifiers.keys())\n",
      "            if gene_dict != {}:\n",
      "                genes.append(gene_dict)\n",
      "    return genes\n",
      "\n",
      "mito_genes = extract_genes(mito)\n",
      "assert len(mito_genes) == 37\n",
      "assert type(mito_genes[0]) == dict"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 4
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "__b)__ We will now use the result of section _2a_ to extract the genes from the complete sequence.  \n",
      "Write a function that receives a SeqRecord and a list of genes (as created in section _2a_) and returns a list of Seqrecords, each corresponding to a single gene. The description of the output SeqRecords should be the gene name, and the id should be an index running from 1 to the number of genes in the list. Run the function on the mitochondrial data. Complete the code."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def genes_records(gb_record,gene_list):\n",
      "    gene_records = []\n",
      "    for gene in gene_list:\n",
      "        start = gene['start']\n",
      "        end = gene['end']\n",
      "        name = gene['name']\n",
      "        rec = gb_record[start:end]\n",
      "        rec.description = name\n",
      "        rec.id = str(len(gene_records)+1)\n",
      "        gene_records.append(rec)\n",
      "    return gene_records\n",
      "\n",
      "mito_gene_records = genes_records(mito,mito_genes)\n",
      "assert len(mito_gene_records) == 37"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 5
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### 3) Printing the result\n",
      "Print the gene records to an output file of your choice, __in fasta format__."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "out_file = \"mito_genes.fasta\"\n",
      "# print to output file\n",
      "SeqIO.write(mito_gene_records,out_file,'fasta')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 6,
       "text": [
        "37"
       ]
      }
     ],
     "prompt_number": 6
    }
   ],
   "metadata": {}
  }
 ]
}