{ "metadata": { "name": "", "signature": "sha256:fe481833b6efa462eca49932a3638d99cd501eea60664394cdd73a9dd0135248" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## [Python Programming for Biologists, Tel-Aviv University / 0411-3122 / Spring 2015](http://py4life.github.io/TAU2015/)\n", "# Homework 5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Analyzing the human mitochondrial genome\n", "In this exercise, we will process and analyze the human mitochondrial genome using Biopython. \n", "All the information that will be used is stored in GenBank record NC_012920. It is recommended that you [inspect it](http://www.ncbi.nlm.nih.gov/nuccore/251831106) before starting to work." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1) Fetching the GenBank record\n", "Write a function that receives a GenBank id (such as NC_012920) and returns a Biopython SeqRecord object of the corresponding result. Use it to fetch the human mitochondrial genome record. Assume the default settings, as shown in class. Ignore any warning messages from NCBI that might be displayed." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from Bio import Entrez\n", "from Bio import SeqIO\n", "\n", "def fetch_gb_by_id(rec_id):\n", "\n", " \n", " \n", " return gb_record\n", "\n", "mito = fetch_gb_by_id('NC_012920')\n", "assert mito.description == 'Homo sapiens mitochondrion, complete genome.'" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2) Extracting the genes\n", "The record fetched in section 1 is the complete sequence of the mitochondrial genome. We will now extract the genes from this sequence. \n", "A sequence record includes many _features_, such as CDS, STS, genes and other. To access the features of the `mito` SeqRecord, simply use `mito.features`. \n", "Let's explore the features field of the record and see some examples by running the following code:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "mito_features = mito.features\n", "print(type(mito_features))\n", "print(type(mito_features[0]))\n", "print('************')\n", "print(mito_features[0])\n", "print('************')\n", "print(mito_features[1])\n", "print('************')\n", "print(mito_features[2])" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can learn more about SeqRecord features by visiting the dedicated [documentation page](http://biopython.org/DIST/docs/_api_161/Bio.SeqFeature.SeqFeature-class.html). \n", "We can see that there are different types of features, and that each one contains several fields of information. We will only need the `type`, the `location` and the `qualifiers` fields, which We can access like this:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "gene = mito.features[2]\n", "print(gene.type)\n", "print(gene.location)\n", "print(gene.qualifiers)\n", "print(type(gene.qualifiers))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__a)__ Write a function that receives a SeqRecord and extract its __gene__ features. The function should return a list of dictionaries, where each dictionary represents one gene. Gene dictionaries will have three keys: `start` (start position), `end` (end position) and `name` (gene name). Complete the code below and use it on the `mito` SeqRecord." ] }, { "cell_type": "code", "collapsed": false, "input": [ "def extract_genes(gb_record):\n", " genes = []\n", " for feat in gb_record.features:\n", " # if feature is gene\n", " if ______________________:\n", " gene_dict = {}\n", " try:\n", " # get start and end positions\n", " location = ________________\n", " start = __________________\n", " end = ___________________\n", " gene_name = _________________\n", " \n", " # insert into dictionary\n", " \n", " \n", " \n", "\n", " except:\n", " print('Failed to obtain stats for feature')\n", " if gene_dict != {}:\n", " genes.append(gene_dict)\n", " return genes\n", "\n", "mito_genes = extract_genes(mito)\n", "assert len(mito_genes) == 37\n", "assert type(mito_genes[0]) == dict" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__b)__ We will now use the result of section _2a_ to extract the genes from the complete sequence. \n", "Write a function that receives a SeqRecord and a list of genes (as created in section _2a_) and returns a list of Seqrecords, each corresponding to a single gene. The description of the output SeqRecords should be the gene name, and the id should be an index running from 1 to the number of genes in the list. Run the function on the mitochondrial data. Complete the code." ] }, { "cell_type": "code", "collapsed": false, "input": [ "def genes_records(gb_record,gene_list):\n", " gene_records = []\n", " for gene in gene_list:\n", " start = ______________\n", " end = ______________\n", " name = ______________\n", " rec = gb_record[start:end]\n", " rec.description = ______________\n", " rec.id = ______________\n", " gene_records.append(rec)\n", " return ______________\n", "\n", "mito_gene_records = genes_records(mito,mito_genes)\n", "assert len(mito_gene_records) == 37" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3) Printing the result\n", "Print the gene records to an output file of your choice, __in fasta format__." ] }, { "cell_type": "code", "collapsed": false, "input": [ "out_file = \"\"\n", "# print to output file\n", "\n" ], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }