{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Frankengenome challenge" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Created by Jacob Pritt with slight changes by Ben Langmead and Ankit Garg\n", "\n", "Dr. Frankenstein is trying to create a new species of bacterium by splicing together the genomes of bacteria A and B. Unfortunately, he lost all but the first two pages of his notes describing how they were spliced together. He needs your help to identify which parts of the frankengenome came from which bacterium. You are given the full frankengenome called `frankengene1.fasta` as well as the first two pages of notes, called `trainingData1.txt` and `testData1.txt`, which have correct labels for the first 50,000 and the next 50,000 bases of the frankengenome respectively. The label `0` indicates the base came from genome A and the label `1` indicates it came from genome B. E.g. the label `0000011111110011` indicates the first 5 bases came from genome A, the next 7 from genome B, the next 2 from genome A, and the last 2 from genome B.\n", "\n", "Implement a classifier that assigns a label to each base of the frankengenome according to whether it came from genome A or B. You are not expected to get the labels 100% correct; you will be graded based on your classifier's accuracy relative to a simple \"reference\" classifier. Accuracy is measured as the fraction of genome positions given the correct label.\n", "\n", "The frankengenome is available here:\n", "\n", "http://www.cs.jhu.edu/~langmea/resources/frankengene1.fasta\n", "\n", "Labels for the first 50,000 bases, the training data, are available here:\n", "\n", "http://www.cs.jhu.edu/~langmea/resources/trainingData1.txt\n", "\n", "Labels for the next 50,000 bases, the test data, are available here:\n", "\n", "http://www.cs.jhu.edu/~langmea/resources/testData1.txt\n", "\n", "You should submit:\n", "\n", "1. the source code for your solution\n", "2. instructions on how to run it, and\n", "3. a file containing a string of `0`s and `1`s, one per genome position, giving your predictions for whether each base of the Frankengenome came from genome A (`0`) or genome B (`1`).\n", "\n", "A good way to solve this problem is using an HMM, similarly to how we solved the CpG island labeling problem in class. You may borrow freely from the posted [HMM code](http://nbviewer.ipython.org/github/BenLangmead/comp-genomics-class/blob/master/notebooks/CG_HMM.ipynb). While you may choose to experiment with higher-order HMM, a 1st order HMM -- where states correspond to genomes (A or B) and emissions to individual bases -- will do fine here.\n", "\n", "When you build your classifier, build it using *only* the data from `trainingData1.txt`. When you test your classifier to assess accuracy, test it using *only* the data from `testData1.txt`. This will give you a more accurate picture about how your classifier performs on the whole genome than if you build and test using the same data.\n", "\n", "Hint: If you build your classifier using `trainingData1.txt` and use it to make predictions for `testData1.txt`, you should be able to label *at least* 98.5% of the positions correctly.\n", "\n", "Hint: The `read_genome_and_training` and `read_training_pairs` functions defined below in Python can be helpful when building the transition and emission probability tables for the HMM. Feel free to use them:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from itertools import islice, tee\n", "\n", "def pairwise(iterable):\n", " \"\"\" Create iterator over adjacent pairs of elements in given\n", " iterator. \"\"\"\n", " a, b = tee(iterable)\n", " next(b, None)\n", " return zip(a, b)\n", "\n", "\n", "def read_fasta_char_by_char(fn):\n", " \"\"\" Read FASTA file, yielding nucleotides one by one \"\"\"\n", " with open(fn) as fh:\n", " for ln in fh:\n", " if ln[0] == '>':\n", " continue\n", " for c in ln.strip():\n", " yield c\n", "\n", "def read_labels_char_by_char(fn):\n", " \"\"\" Read label file, yielding labels one by one \"\"\"\n", " with open(fn) as fh:\n", " for ln in fh:\n", " for c in ln.strip():\n", " yield c\n", "\n", "def read_genome_and_training(franken_fn, training_fn):\n", " \"\"\" Given filename for Frankengenome and filename for training data,\n", " return iterator over labeled positions of the Frankengenome,\n", " yielding pairs like ('A', 0) \"\"\"\n", " return zip(read_fasta_char_by_char(franken_fn),\n", " read_labels_char_by_char(training_fn))\n", "\n", "def read_training_pairs(training_fn):\n", " \"\"\" Return iterator over adjacent label pairs in given label file \"\"\"\n", " return pairwise(read_labels_char_by_char(training_fn))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Hint: This Python function can be helpful for evaluating your classifier on the test data. Feel free to use it:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def evaluate_on_test_data(prediction, test_data_fn):\n", " \"\"\" Given an iterator over all the predictions (one per\n", " genome position), and given the name of the test data\n", " file, return a pair giving the number of correct and\n", " the total number of predictions. \"\"\"\n", " ncorrect, ntot = 0, 0\n", " for pred, act in zip(islice(prediction, 50000, None), # skip first 50K predictions\n", " read_labels_char_by_char(test_data_fn)):\n", " if pred == act:\n", " ncorrect += 1\n", " ntot += 1\n", " return ncorrect, ntot" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.10" } }, "nbformat": 4, "nbformat_minor": 0 }