{ "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.4.1" }, "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## [Python Programming for Biologists, Tel-Aviv University / 0411-3122 / Spring 2015](http://py4life.github.io/TAU2015/)\n", "# Homework 7" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%matplotlib inline\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import seaborn as sns\n", "sns.set_style('white')\n", "sns.set_context('notebook', font_scale=1.2)\n", "sns.set_palette(sns.color_palette('Set1', 9))" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 4 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1) Ebola outbeak\n", "\n", "In this question we will analyze data from the 2014 Ebola global outbreak.\n", "\n", "**a)** Start by reading the data in the CSV file from . The file contains a time series of case counts and deaths from the [World Health Organization](http://www.who.int/csr/don/en/) and [WHO situation reports](http://www.who.int/csr/disease/ebola/situation-reports/en/). The data was originally compiled by [Caitlin Rivers](https://github.com/cmrivers/ebola).\n", "\n", "Remember that _pandas_ can read files from a remote location so you don't have to copy it to your computer, unless you want to work offline." ] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [], "prompt_number": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, the data contains the accumulated number of cases and deaths in several countries. The `Date` column specifies the date of the data point. The `Day` column is days since 22 March 2014. The rest of the columns contain the accumulated number of cases and deaths. Cells that have no data - there was no update for that country on that date - contain `NaN`.\n", "\n", "For example, `Cases_Guinea` is the accumulated number of cases of Ebola in Guinea (103 by 37 March 2014, 2730 by 31 December 2014, no additional cases reported until 3 January 2015); `Deaths_SierraLeone` is the accumulated number of deaths by Ebola in Sierra Leone (5 by 27 March 2014, 2827 by 31 December 2014).\n", "\n", "**b)** Next, plot the number of cases in Liberia, Guinea and Sierra Leone over the entire time of the outbreak. You can use for x-axis the number of days since the start of the outbreak, but bonus points for using the date. (Note: you can invert the axis by calling `plt.gca().invert_xaxis()`)." ] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [], "prompt_number": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because our data is full of holes and noise we want to fit a smooth function to the data. We will use the simplest smoothing function - a polynomial. \n", "\n", "> _Reminder_: a polynomial of the argument $t$ with a degree $n$ is a function of t that has this structure: $f(t) = a_0 + a_1 t + a_2 t^2 + ... + a_n t^n$. The $a_i$ are called the _coefficients of the polynomial_. A well known result in Calculus is that [any continuous function can be approximated with a finite number of polynomials](http://en.wikipedia.org/wiki/Stone%E2%80%93Weierstrass_theorem).\n", "\n", "To fit a polynomial of degree $n$ to the data one can use the method we learned in class (_curve_fit_). Another way to fit polynomials is the function [`numpy.polyfit`](http://docs.scipy.org/doc/numpy/reference/generated/numpy.polyfit.html). This function's input is the x and y values and the degree of the polynomial. Its output is the coefficients ($a_0, a_1, ..., a_n$). From these coefficients we can then create a polynomial function (in the Python sense of function) that we can use like any other function. This is done by callin `np.poly1d` on the coefficients tuple.\n", "\n", "**c)** Fit a polynomial to the data in cases in Liberia. Use either `curev_fit` or `plotfit` or any other solution you find. \n", "\n", "Plot the data (using markers) and the polynomial fit (using a line). Note how the fit changes when you change the degree of the polynomial. The x-axis should be `Day`." ] }, { "cell_type": "code", "collapsed": true, "input": [], "language": "python", "metadata": {}, "outputs": [], "prompt_number": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "**d)** Plot the data points and polynomial fit for cases in Liberia, Guinea, Sierra Leone and Nigeria, all on the same figure. Make sure the polynomial lines have different colors and add a legend (the data points can all be in the same color)." ] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [], "prompt_number": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2) Wright-Fisher model\n", "\n", "The Wright-Fisher model (WF) is the standard model in population genetics. It is used to model the change in allele frequencies over time due to the action of natural selection, random genetic drift, mutation and recombination.\n", "\n", "In this question we will use the WF to simulate the frequency of an allele in a single locus (one gene), bi-alleleic (two alleles) model. \n", "\n", "### Only drift (no selection, mutation or recombination)\n", "\n", "In the first simulation we will do, we will consider only the effect of [random genetic drift]() on the change in frequency of one allele (let's call it $A$). \n", "\n", "The change in the frequency from one generation to the next is described by this prcess: imagine a pot with $N$ balls. A fraction $p$ of the balls is red. We get a new pot. We draw a ball from the old pot, put a ball of the same color in the new pot, and return the drawn ball to the old pot. We do this until the new pot has $N$ balls, then we throw away the old pot, get a new pot, and start again. This game can go on until, by chance, there are no more red balls or all the balls are red.\n", "\n", "This process can be written using this equation:\n", "\n", "$$\n", "n \\sim Bin(N,p) \\\\\n", "p' = \\frac{n}{N}\n", "$$\n", "\n", "where $N$ is the population size (total number of alleles), $p$ and $p'$ are the current frequency of $A$ and its frequency in the next generation. $Bin(N,p)$ is the binomial distribution with $N$ trials and probability of success $p$ -- therefore $n$ is a random variable (random number) drawn from that binomial distribution. \n", "\n", "**a)** Write a function called `simulation(N, p0)`. \n", "The function will simulate the WF process until allele $A$ is extinct (frequency is 0) or until it fixates (frequency is 1). \n", "\n", "Input: `N` is the population size, `p0` is the initial frequency of allele $A$. \n", "\n", "Output: `p` is an `array` or a `list` of the frequencies of $A$ in every generation. \n", "\n", "The function will also plot a figure of the frequency of $A$ as a function of the number of generations." ] }, { "cell_type": "code", "collapsed": false, "input": [ "N = 1e4\n", "p0 = 0.5\n" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": [ "**b)** Next we want estimate the _fixation probability_ of allele $A$. We can do it by running a large number of simulations (how many?) and check at the end of each simulation if the allele is fixed or extinct. The number of fixations divided by the total number of simulations is a good estimate of the _fixation probability_. \n", "\n", "Run the simulation as many times as you think is neccessary. Calculate and print the estimated _fixation probability_." ] }, { "cell_type": "code", "collapsed": false, "input": [ "N = 1e4\n", "p0 = 0.5\n", "fixations = 0\n", "simulations = 100" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 2 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Bonus: Drift & selection (no mutation or recombination)\n", "\n", "If selection also affects the process, then the chance to draw a red ball (allele $A$) is a bit higher than the other balls. This is usually modeled by saying that the chance to draw a non-red ball is $1-s$ relative to the chance to draw a red ball. The equation is then:\n", "\n", "$$\n", "\\hat{p} = \\frac{p}{1-s+sp} \\\\\n", "n \\sim Bin(N, \\hat{p}) \\\\\n", "p' = \\frac{n}{N}\n", "$$\n", "\n", "**c)** Rewrite the simulation function to include selection (add argument, change according to equation) and run the simulations loop again to estimate the fixation probability. Note that the initial probability can be rather low and still lead to many fixations - theory shows that in a large population ($N>10000$) and weak selection ($s<0.01$) the _fixation probability_ is roughly $2s$.\n", "\n", "**Note**: This is a bonus question. You can get the full grade without doing it." ] }, { "cell_type": "code", "collapsed": false, "input": [ "N = 1e4\n", "p0 = 0.5\n", "s = 0.01" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 3 }, { "cell_type": "code", "collapsed": false, "input": [ "N = 1e4\n", "p0 = 0.01\n", "s = 0.01\n", "fixations = 0\n", "simulations = 100" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 4 } ], "metadata": {} } ] }