{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Homework 2.2: Exponential conjugate prior (55 pts)\n", "\n", "[Data set download](http://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?tool=portal&sendto=on&log$=seqview&db=nuccore&dopt=fasta&val=CP042865.1)\n", "\n", "*This problem was motivated by discussion in section 2.10 of [Holmes and Huber's book](http://web.stanford.edu/class/bios221/book/).*\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**a)** Show that the conjugate distribution for the Exponential distribution is the Gamma distribution. How are the parameters of the Gamma updated from the prior to the posterior? (*You are welcome to look at [Wikipedia's table of conjugate priors](https://en.wikipedia.org/wiki/Conjugate_prior) to check your answer, but you should not look up the actual proof that the Gamma distribution is conjugate to the Exponential.*)\n", "\n", "**b)** Download the sequence of the chromosomal DNA of *E coli* strain ATCC BAA-196 [here](http://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?tool=portal&sendto=on&log$=seqview&db=nuccore&dopt=fasta&val=CP042865.1) in FASTA format. This strain is resistant to multiple drugs and is used in studies of antibiotic resistance. The sequence was published in [this paper](https://doi.org/10.1128/MRA.01118-19). Read the sequence in as a single string, using the function below, if you like." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "def read_fasta_single_record(filename):\n", " \"\"\"Read a sequence in from a FASTA file containing a single sequence.\n", " \n", " We assume that the first line of the file is the descriptor and all\n", " subsequent lines are sequence. \n", " \"\"\"\n", " with open(filename, 'r') as f:\n", " # Read in descriptor\n", " descriptor = f.readline().rstrip()\n", "\n", " # Read in sequence, stripping the whitespace from each line\n", " seq = ''\n", " line = f.readline().rstrip()\n", " while line != '':\n", " seq += line\n", " line = f.readline().rstrip()\n", " \n", " return descriptor, seq\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**c)** Find the index in the sequence of all [Shine-Delgarno motifs](https://en.wikipedia.org/wiki/Shine-Dalgarno_sequence), which is an important motif in initiation of protein synthesis. The Shine-Delgarno sequence is `AGGAGGT`. You can use the function below to find recognition sequences." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import re\n", "\n", "\n", "def recognition_sites_with_re(seq, recog_seq):\n", " \"\"\"Find the indices of all recognition sites in a sequence.\"\"\"\n", " sites = []\n", " for site in re.finditer(recog_seq, seq):\n", " sites.append(site.start())\n", " \n", " return sites" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Store the number of bases between each occurrence of the motif as an array.\n", "\n", "**d)** We will model the distance between each motif as Exponentially distributed. Explain why this may be a reasonable model.\n", "\n", "**e)** Make an exploratory plot of the ECDF of inter-motif distances. Does it look Exponential?\n", "\n", "**f)** Use an Exponential likelihood and a Gamma prior distribution to compute and plot the posterior distribution of the rate parameter (the inverse of the characteristic distance between motifs) of the Exponential distribution." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.5" } }, "nbformat": 4, "nbformat_minor": 4 }