{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Notebook 5: \n", "\n", "This is an IPython notebook. Most of the code is composed of bash scripts, indicated by %%bash at the top of the cell, otherwise it is IPython code. This notebook includes code to download, assemble and analyze a published RADseq data set." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [] } ], "source": [ "### Notebook 5\n", "### Data set 5 (Heliconius)\n", "### Authors: Nadeau et al. (2013)\n", "### Data Location: ERP000991" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Download the sequence data\n", "Below I read in EraRunTable.txt for this project which contains all of the information we need to download the data. \n", "\n", "+ Project ERA: ERP000991" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "%%bash\n", "## make a new directory for this analysis\n", "mkdir -p empirical_5/fastq/" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### sub-sampling\n", "The authors used only a subset of the data that are on the archive for their phylogenetic analyses so I will choose the same 54 samples here which are listed in Table S1 of their publication. " ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "54" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "subsamples = ['ERS070268','ERS070269','ERS070270','ERS070271','ERS070272',\n", " 'ERS070273','ERS070274','ERS070275','ERS070276','ERS070277',\n", " 'ERS070246','ERS070247','ERS070248','ERS070249','ERS070245',\n", " 'ERS070252','ERS070253','ERS070254','ERS070256','ERS070255',\n", " 'ERS074398','ERS074399','ERS074400','ERS074401','ERS074402',\n", " 'ERS074403','ERS074404','ERS074405','ERS074406','ERS074407',\n", " 'ERS074408','ERS074409','ERS074410','ERS074411','ERS074412',\n", " 'ERS074413','ERS074414','ERS074415','ERS074416','ERS074417',\n", " 'ERS074418','ERS074419','ERS074420','ERS074421','ERS074422',\n", " 'ERS074423','ERS074424','ERS074425','ERS074426','ERS074427',\n", " 'ERS074428','ERS074429','ERS070250','ERS070251']\n", "\n", "len(subsamples)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### For each ERS (individuals) get all of the ERR (sequence file accessions)." ] }, { "cell_type": "code", "execution_count": 82, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " study_accession secondary_study_accession sample_accession \\\n", "0 PRJEB2743 ERP000991 SAMEA1322899 \n", "1 PRJEB2743 ERP000991 SAMEA1322899 \n", "2 PRJEB2743 ERP000991 SAMEA1322873 \n", "3 PRJEB2743 ERP000991 SAMEA1322873 \n", "4 PRJEB2743 ERP000991 SAMEA1322940 \n", "\n", " secondary_sample_accession experiment_accession run_accession tax_id \\\n", "0 ERS070236 ERX030872 ERR053824 33444 \n", "1 ERS070236 ERX030873 ERR053825 33444 \n", "2 ERS070237 ERX030874 ERR053826 33444 \n", "3 ERS070237 ERX030875 ERR053827 33444 \n", "4 ERS070238 ERX030876 ERR053828 33444 \n", "\n", " scientific_name instrument_model library_layout \\\n", "0 Heliconius elevatus Illumina Genome Analyzer IIx PAIRED \n", "1 Heliconius elevatus Illumina Genome Analyzer IIx PAIRED \n", "2 Heliconius elevatus Illumina Genome Analyzer IIx PAIRED \n", "3 Heliconius elevatus Illumina Genome Analyzer IIx PAIRED \n", "4 Heliconius elevatus Illumina Genome Analyzer IIx PAIRED \n", "\n", " fastq_ftp \\\n", "0 ftp.sra.ebi.ac.uk/vol1/fastq/ERR053/ERR053824/... \n", "1 ftp.sra.ebi.ac.uk/vol1/fastq/ERR053/ERR053825/... \n", "2 ftp.sra.ebi.ac.uk/vol1/fastq/ERR053/ERR053826/... \n", "3 ftp.sra.ebi.ac.uk/vol1/fastq/ERR053/ERR053827/... \n", "4 ftp.sra.ebi.ac.uk/vol1/fastq/ERR053/ERR053828/... \n", "\n", " fastq_galaxy \\\n", "0 ftp.sra.ebi.ac.uk/vol1/fastq/ERR053/ERR053824/... \n", "1 ftp.sra.ebi.ac.uk/vol1/fastq/ERR053/ERR053825/... \n", "2 ftp.sra.ebi.ac.uk/vol1/fastq/ERR053/ERR053826/... \n", "3 ftp.sra.ebi.ac.uk/vol1/fastq/ERR053/ERR053827/... \n", "4 ftp.sra.ebi.ac.uk/vol1/fastq/ERR053/ERR053828/... \n", "\n", " submitted_ftp \\\n", "0 ftp.sra.ebi.ac.uk/vol1/ERA066/ERA066292/fastq/... \n", "1 ftp.sra.ebi.ac.uk/vol1/ERA066/ERA066292/fastq/... \n", "2 ftp.sra.ebi.ac.uk/vol1/ERA066/ERA066292/fastq/... \n", "3 ftp.sra.ebi.ac.uk/vol1/ERA066/ERA066292/fastq/... \n", "4 ftp.sra.ebi.ac.uk/vol1/ERA066/ERA066292/fastq/... \n", "\n", " submitted_galaxy cram_index_ftp \\\n", "0 ftp.sra.ebi.ac.uk/vol1/ERA066/ERA066292/fastq/... NaN \n", "1 ftp.sra.ebi.ac.uk/vol1/ERA066/ERA066292/fastq/... NaN \n", "2 ftp.sra.ebi.ac.uk/vol1/ERA066/ERA066292/fastq/... NaN \n", "3 ftp.sra.ebi.ac.uk/vol1/ERA066/ERA066292/fastq/... NaN \n", "4 ftp.sra.ebi.ac.uk/vol1/ERA066/ERA066292/fastq/... NaN \n", "\n", " cram_index_galaxy \n", "0 NaN \n", "1 NaN \n", "2 NaN \n", "3 NaN \n", "4 NaN \n" ] } ], "source": [ "## IPython code\n", "import pandas as pd\n", "import numpy as np\n", "import urllib2\n", "import os\n", "\n", "## open the SRA run table from github url\n", "url = \"https://raw.githubusercontent.com/\"+\\\n", " \"dereneaton/RADmissing/master/empirical_9_EraRunTable.txt\"\n", "intable = urllib2.urlopen(url)\n", "indata = pd.read_table(intable, sep=\"\\t\")\n", "\n", "## print first few rows\n", "print indata.head()" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def wget_download_ERR(ERR, outdir, outname):\n", " \"\"\" Python function to get sra data from ncbi and write to\n", " outdir with a new name using bash call wget \"\"\"\n", " \n", " ## get output name\n", " output = os.path.join(outdir, outname+\".fastq.gz\")\n", "\n", " ## create a call string \n", " call = \"wget -q -r -nH --cut-dirs=9 -O \"+output+\" \"+\\\n", " \"ftp://ftp.sra.ebi.ac.uk/vol1/fastq/\"+\\\n", " \"{}/{}/{}_1.fastq.gz\".format(ERR[:6], ERR, ERR)\n", " \n", " ## call bash script\n", " ! $call " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we pass the SRR number and the sample name to the `wget_download` function so that the files are saved with their sample names. " ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Heliconius timareta ssp. JD-2011 \tERS070246\tERR053848\n", "Heliconius timareta ssp. JD-2011 \tERS070246\tERR053849\n", "Heliconius timareta ssp. JD-2011 \tERS070246\tERR053850\n", "Heliconius timareta ssp. JD-2011 \tERS070247\tERR053851\n", "Heliconius timareta ssp. JD-2011 \tERS070247\tERR053852\n", "Heliconius timareta ssp. JD-2011 \tERS070247\tERR053853\n", "Heliconius timareta ssp. JD-2011 \tERS070245\tERR053854\n", "Heliconius timareta ssp. JD-2011 \tERS070245\tERR053855\n", "Heliconius timareta ssp. JD-2011 \tERS070248\tERR053856\n", "Heliconius timareta ssp. JD-2011 \tERS070248\tERR053857\n", "Heliconius timareta ssp. JD-2011 \tERS070249\tERR053858\n", "Heliconius timareta ssp. JD-2011 \tERS070249\tERR053859\n", "Heliconius timareta timareta \tERS070250\tERR053860\n", "Heliconius timareta timareta \tERS070250\tERR053861\n", "Heliconius timareta timareta \tERS070250\tERR053862\n", "Heliconius timareta timareta \tERS070251\tERR053863\n", "Heliconius timareta timareta \tERS070251\tERR053864\n", "Heliconius timareta timareta \tERS070251\tERR053865\n", "Heliconius hecale felix \tERS070252\tERR053866\n", "Heliconius hecale felix \tERS070252\tERR053867\n", "Heliconius hecale felix \tERS070253\tERR053868\n", "Heliconius hecale felix \tERS070253\tERR053869\n", "Heliconius hecale felix \tERS070254\tERR053870\n", "Heliconius hecale felix \tERS070254\tERR053871\n", "Heliconius hecale felix \tERS070255\tERR053872\n", "Heliconius hecale felix \tERS070255\tERR053873\n", "Heliconius hecale felix \tERS070255\tERR053874\n", "Heliconius hecale felix \tERS070256\tERR053875\n", "Heliconius hecale felix \tERS070256\tERR053876\n", "Heliconius hecale felix \tERS070256\tERR053877\n", "Heliconius melpomene aglaope \tERS070268\tERR053897\n", "Heliconius melpomene aglaope \tERS070268\tERR053898\n", "Heliconius melpomene aglaope \tERS070269\tERR053899\n", "Heliconius melpomene aglaope \tERS070269\tERR053900\n", "Heliconius melpomene aglaope \tERS070270\tERR053901\n", "Heliconius melpomene aglaope \tERS070270\tERR053902\n", "Heliconius melpomene aglaope \tERS070271\tERR053903\n", "Heliconius melpomene aglaope \tERS070271\tERR053904\n", "Heliconius melpomene aglaope \tERS070272\tERR053905\n", "Heliconius melpomene aglaope \tERS070272\tERR053906\n", "Heliconius melpomene amaryllis \tERS070273\tERR053907\n", "Heliconius melpomene amaryllis \tERS070273\tERR053908\n", "Heliconius melpomene amaryllis \tERS070274\tERR053909\n", "Heliconius melpomene amaryllis \tERS070274\tERR053910\n", "Heliconius melpomene amaryllis \tERS070275\tERR053911\n", "Heliconius melpomene amaryllis \tERS070275\tERR053912\n", "Heliconius melpomene amaryllis \tERS070276\tERR053913\n", "Heliconius melpomene amaryllis \tERS070276\tERR053914\n", "Heliconius melpomene amaryllis \tERS070277\tERR053915\n", "Heliconius melpomene amaryllis \tERS070277\tERR053916\n", "Heliconius heurippa \tERS074398\tERR055402\n", "Heliconius heurippa \tERS074399\tERR055403\n", "Heliconius heurippa \tERS074400\tERR055404\n", "Heliconius heurippa \tERS074401\tERR055405\n", "Heliconius heurippa \tERS074402\tERR055406\n", "Heliconius heurippa \tERS074403\tERR055407\n", "Heliconius cydno cordula \tERS074404\tERR055408\n", "Heliconius cydno cordula \tERS074405\tERR055409\n", "Heliconius cydno cordula \tERS074406\tERR055410\n", "Heliconius cydno cordula \tERS074407\tERR055411\n", "Heliconius cydno cordula \tERS074408\tERR055412\n", "Heliconius cydno cordula \tERS074409\tERR055413\n", "Heliconius cydno chioneus \tERS074410\tERR055414\n", "Heliconius cydno chioneus \tERS074411\tERR055415\n", "Heliconius cydno weymeri \tERS074412\tERR055416\n", "Heliconius cydno weymeri \tERS074413\tERR055417\n", "Heliconius cydno cydnides \tERS074414\tERR055418\n", "Heliconius cydno cydnides \tERS074415\tERR055419\n", "Heliconius melpomene melpomene \tERS074416\tERR055420\n", "Heliconius melpomene melpomene \tERS074417\tERR055421\n", "Heliconius melpomene melpomene \tERS074418\tERR055422\n", "Heliconius melpomene melpomene \tERS074419\tERR055423\n", "Heliconius melpomene melpomene \tERS074420\tERR055424\n", "Heliconius melpomene melpomene \tERS074421\tERR055425\n", "Heliconius melpomene melpomene \tERS074422\tERR055426\n", "Heliconius melpomene melpomene \tERS074423\tERR055427\n", "Heliconius melpomene melpomene \tERS074424\tERR055428\n", "Heliconius melpomene melpomene \tERS074425\tERR055429\n", "Heliconius melpomene rosina \tERS074426\tERR055430\n", "Heliconius melpomene rosina \tERS074427\tERR055431\n", "Heliconius melpomene ecuadorensis \tERS074428\tERR055432\n", "Heliconius melpomene ecuadorensis \tERS074429\tERR055433\n" ] } ], "source": [ "for ID, ERS, ERR in zip(indata.scientific_name, \n", " indata.secondary_sample_accession,\n", " indata.run_accession):\n", " if ERS in subsamples:\n", " name = ID.split()[1]\n", " name += \"_\"+ERS+\"_\"+ERR\n", " print \"{:<35}\\t{}\\t{}\".format(ID, ERS, ERR)\n", " wget_download_ERR(ERR, \"empirical_5/fastq/\", name)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Note:\n", "The data look kind of weird because there are a lot of As in the beginning. I figured out it is just because the sequences are sorted alphabetically." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Merge technical replicates\n", "This study includes several technical replicates per sequenced individuals, which we combine into a single file for each individual here. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Make a new directory for merged files" ] }, { "cell_type": "code", "execution_count": 59, "metadata": { "collapsed": true }, "outputs": [], "source": [ "%%bash\n", "mkdir -p empirical_5/fastq/merged" ] }, { "cell_type": "code", "execution_count": 62, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "melpomene_ERS070276 merged\n", "melpomene_ERS070277 merged\n", "melpomene_ERS070274 merged\n", "melpomene_ERS070275 merged\n", "melpomene_ERS070272 merged\n", "melpomene_ERS070273 merged\n", "melpomene_ERS070270 merged\n", "melpomene_ERS070271 merged\n", "timareta_ERS070247 merged\n", "timareta_ERS070246 merged\n", "timareta_ERS070245 merged\n", "timareta_ERS070249 merged\n", "timareta_ERS070248 merged\n", "cydno_ERS074413 merged\n", "cydno_ERS074412 merged\n", "cydno_ERS074411 merged\n", "cydno_ERS074410 merged\n", "cydno_ERS074415 merged\n", "cydno_ERS074414 merged\n", "hecale_ERS070253 merged\n", "hecale_ERS070252 merged\n", "hecale_ERS070256 merged\n", "hecale_ERS070255 merged\n", "hecale_ERS070254 merged\n", "melpomene_ERS074418 merged\n", "melpomene_ERS074419 merged\n", "melpomene_ERS074416 merged\n", "melpomene_ERS074417 merged\n", "melpomene_ERS070269 merged\n", "melpomene_ERS070268 merged\n", "timareta_ERS070250 merged\n", "timareta_ERS070251 merged\n" ] } ], "source": [ "##IPython code\n", "import glob\n", "import os\n", "\n", "## get all fastq files\n", "taxa = [i.split(\"/\")[-1].rsplit('_',1)[0] for i in \\\n", " glob.glob(\"empirical_5/fastq/*.gz\")]\n", "\n", "## iterate over individuals and create merge file\n", "for taxon in set(taxa):\n", " print taxon, \"merged\"\n", " reps = glob.glob(\"empirical_5/fastq/\"+taxon+\"*\")\n", "\n", " if len(reps) > 1: \n", " ## get replicate files\n", " with open(\"empirical_5/fastq/merged/\"+taxon+\".fastq.gz\", 'wb') as out:\n", " for rep in reps:\n", " with open(rep, 'rb') as tempin:\n", " out.write(tempin.read())\n", " else:\n", " dirs, ff = os.path.split(reps[0])\n", " os.rename(os.path.join(dirs,ff), os.path.join(dirs,\"merged\",taxon+\".fastq.gz\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Make a params file" ] }, { "cell_type": "code", "execution_count": 64, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "pyRAD 3.0.63\n" ] } ], "source": [ "%%bash\n", "pyrad --version" ] }, { "cell_type": "code", "execution_count": 65, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\tnew params.txt file created\n" ] } ], "source": [ "%%bash\n", "## remove old params file if it exists\n", "rm params.txt \n", "\n", "## create a new default params file\n", "pyrad -n " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Note: \n", "The data here are from Illumina Casava <1.8, so the phred scores are offset by 64 instead of 33, so we use that in the params file below." ] }, { "cell_type": "code", "execution_count": 68, "metadata": { "collapsed": false }, "outputs": [], "source": [ "%%bash\n", "## substitute new parameters into file\n", "sed -i '/## 1. /c\\empirical_5/ ## 1. working directory ' params.txt\n", "sed -i '/## 6. /c\\TGCAG ## 6. cutters ' params.txt\n", "sed -i '/## 7. /c\\20 ## 7. N processors ' params.txt\n", "sed -i '/## 9. /c\\6 ## 9. NQual ' params.txt\n", "sed -i '/## 10./c\\.85 ## 10. clust threshold ' params.txt\n", "sed -i '/## 12./c\\4 ## 12. MinCov ' params.txt\n", "sed -i '/## 13./c\\10 ## 13. maxSH ' params.txt\n", "sed -i '/## 14./c\\empirical_5_m4 ## 14. output name ' params.txt\n", "sed -i '/## 18./c\\empirical_5/fastq/merged/*.gz ## 18. data location ' params.txt\n", "sed -i '/## 20./c\\64 ## 20. phred offset ' params.txt\n", "sed -i '/## 29./c\\2,2 ## 29. trim overhang ' params.txt\n", "sed -i '/## 30./c\\p,n,s ## 30. output formats ' params.txt" ] }, { "cell_type": "code", "execution_count": 69, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "==** parameter inputs for pyRAD version 3.0.63 **======================== affected step ==\r\n", "empirical_5/ ## 1. working directory \r\n", "./*.fastq.gz ## 2. Loc. of non-demultiplexed files (if not line 18) (s1)\r\n", "./*.barcodes ## 3. Loc. of barcode file (if not line 18) (s1)\r\n", "vsearch ## 4. command (or path) to call vsearch (or usearch) (s3,s6)\r\n", "muscle ## 5. command (or path) to call muscle (s3,s7)\r\n", "TGCAG ## 6. cutters \r\n", "20 ## 7. N processors \r\n", "6 ## 8. Mindepth: min coverage for a cluster (s4,s5)\r\n", "6 ## 9. NQual \r\n", ".85 ## 10. clust threshold \r\n", "rad ## 11. Datatype: rad,gbs,pairgbs,pairddrad,(others:see docs)(all)\r\n", "4 ## 12. MinCov \r\n", "10 ## 13. maxSH \r\n", "empirical_5 ## 14. output name \r\n", "==== optional params below this line =================================== affected step ==\r\n", " ## 15.opt.: select subset (prefix* only selector) (s2-s7)\r\n", " ## 16.opt.: add-on (outgroup) taxa (list or prefix*) (s6,s7)\r\n", " ## 17.opt.: exclude taxa (list or prefix*) (s7)\r\n", "empirical_5/fastq/merged/*.gz ## 18. data location \r\n", " ## 19.opt.: maxM: N mismatches in barcodes (def= 1) (s1)\r\n", "64 ## 20. phred offset \r\n", " ## 21.opt.: filter: def=0=NQual 1=NQual+adapters. 2=strict (s2)\r\n", " ## 22.opt.: a priori E,H (def= 0.001,0.01, if not estimated) (s5)\r\n", " ## 23.opt.: maxN: max Ns in a cons seq (def=5) (s5)\r\n", " ## 24.opt.: maxH: max heterozyg. sites in cons seq (def=5) (s5)\r\n", " ## 25.opt.: ploidy: max alleles in cons seq (def=2;see docs) (s4,s5)\r\n", " ## 26.opt.: maxSNPs: (def=100). Paired (def=100,100) (s7)\r\n", " ## 27.opt.: maxIndels: within-clust,across-clust (def. 3,99) (s3,s7)\r\n", " ## 28.opt.: random number seed (def. 112233) (s3,s6,s7)\r\n", "2,2 ## 29. trim overhang \r\n", "p,n,s ## 30. output formats \r\n", " ## 31.opt.: maj. base call at depth>x> log.txt 2>&1 " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "%%bash\n", "sed -i '/## 12./c\\2 ## 12. MinCov ' params.txt\n", "sed -i '/## 14./c\\empirical_5_m2 ## 14. output name ' params.txt" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "pyrad -p params.txt -s 7 >> log.txt 2>&1 " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Results\n", "We are interested in the relationship between the amount of input (raw) data between any two samples, the average coverage they recover when clustered together, and the phylogenetic distances separating samples. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Raw data amounts\n", "The average number of raw reads per sample is 1.36M." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "count 54.000000\n", "mean 5315416.333333\n", "std 2877440.775877\n", "min 2022986.000000\n", "25% 3310234.000000\n", "50% 3784971.000000\n", "75% 8453907.500000\n", "max 12011939.000000\n", "Name: passed.total, dtype: float64\n", "\n", "most raw data in sample:\n", "15 hecale_ERS070255\n", "Name: sample , dtype: object\n" ] } ], "source": [ "## read in the data\n", "s2dat = pd.read_table(\"empirical_5/stats/s2.rawedit.txt\", header=0, nrows=55)\n", "\n", "## print summary stats\n", "print s2dat[\"passed.total\"].describe()\n", "\n", "## find which sample has the most raw data\n", "maxraw = s2dat[\"passed.total\"].max()\n", "print \"\\nmost raw data in sample:\"\n", "print s2dat['sample '][s2dat['passed.total']==maxraw]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Look at distributions of coverage\n", "pyrad v.3.0.63 outputs depth information for each sample which I read in here and plot. First let's ask which sample has the highest depth of coverage. The std of coverages is pretty low in this data set compared to several others. " ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "summary of means\n", "==================\n", "count 54.000000\n", "mean 47.842574\n", "std 19.705225\n", "min 15.745000\n", "25% 38.288250\n", "50% 44.435500\n", "75% 56.646000\n", "max 92.441000\n", "Name: dpt.me, dtype: float64\n", "\n", "summary of std\n", "==================\n", "count 54.000000\n", "mean 165.924389\n", "std 76.682339\n", "min 34.403000\n", "25% 118.026500\n", "50% 150.211500\n", "75% 215.882000\n", "max 381.038000\n", "Name: dpt.sd, dtype: float64\n", "\n", "summary of proportion lowdepth\n", "==================\n", "count 54.000000\n", "mean 0.359732\n", "std 0.135457\n", "min 0.151041\n", "25% 0.257660\n", "50% 0.347190\n", "75% 0.466101\n", "max 0.669051\n", "dtype: float64\n", "\n", "highest coverage in sample:\n", "53 timareta_ERS070251\n", "Name: taxa, dtype: object\n" ] } ], "source": [ "## read in the s3 results\n", "s5dat = pd.read_table(\"empirical_5/stats/s3.clusters.txt\", header=0, nrows=54)\n", "\n", "## print summary stats\n", "print \"summary of means\\n==================\"\n", "print s5dat['dpt.me'].describe()\n", "\n", "## print summary stats\n", "print \"\\nsummary of std\\n==================\"\n", "print s5dat['dpt.sd'].describe()\n", "\n", "## print summary stats\n", "print \"\\nsummary of proportion lowdepth\\n==================\"\n", "print pd.Series(1-s5dat['d>5.tot']/s5dat[\"total\"]).describe()\n", "\n", "## find which sample has the greatest depth of retained loci\n", "max_hiprop = (s5dat[\"d>5.tot\"]/s5dat[\"total\"]).max()\n", "print \"\\nhighest coverage in sample:\"\n", "print s5dat['taxa'][s5dat['d>5.tot']/s5dat[\"total\"]==max_hiprop]" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "44.0298150232 117.15674992\n" ] } ], "source": [ "## print mean and std of coverage for the highest coverage sample\n", "with open(\"empirical_5/clust.85/timareta_ERS070251.depths\", 'rb') as indat:\n", " depths = np.array(indat.read().strip().split(\",\"), dtype=int)\n", " \n", "print depths.mean(), depths.std()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Plot the coverage for the sample with highest mean coverage\n", "Green shows the loci that were discarded and orange the loci that were retained. The majority of data were discarded for being too low of coverage. " ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/html": [ "
0102030Depth of coverage (N reads)010002000300040005000N locidataset5/sample=timareta_ERS070251
  • Save as .csv
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import toyplot\n", "import toyplot.svg\n", "import numpy as np\n", "\n", "## read in the depth information for this sample\n", "with open(\"empirical_5/clust.85/timareta_ERS070251.depths\", 'rb') as indat:\n", " depths = np.array(indat.read().strip().split(\",\"), dtype=int)\n", " \n", "## make a barplot in Toyplot\n", "canvas = toyplot.Canvas(width=350, height=300)\n", "axes = canvas.axes(xlabel=\"Depth of coverage (N reads)\", \n", " ylabel=\"N loci\", \n", " label=\"dataset5/sample=timareta_ERS070251\")\n", "\n", "## select the loci with depth > 5 (kept)\n", "keeps = depths[depths>5]\n", "\n", "## plot kept and discarded loci\n", "edat = np.histogram(depths, range(30)) # density=True)\n", "kdat = np.histogram(keeps, range(30)) #, density=True)\n", "axes.bars(edat)\n", "axes.bars(kdat)\n", "\n", "#toyplot.svg.render(canvas, \"empirical_5_depthplot.svg\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Print final stats table" ] }, { "cell_type": "code", "execution_count": 94, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\r\n", "\r\n", "125876 ## loci with > minsp containing data\r\n", "119819 ## loci with > minsp containing data & paralogs removed\r\n", "119819 ## loci with > minsp containing data & paralogs removed & final filtering\r\n", "\r\n", "## number of loci recovered in final data set for each taxon.\r\n", "taxon\tnloci\r\n", "cydno_ERS074404 \t37350\r\n", "cydno_ERS074405 \t38572\r\n", "cydno_ERS074406 \t37734\r\n", "cydno_ERS074407 \t37863\r\n", "cydno_ERS074408 \t37843\r\n", "cydno_ERS074409 \t38464\r\n", "cydno_ERS074410 \t33024\r\n", "cydno_ERS074411 \t30593\r\n", "cydno_ERS074412 \t22874\r\n", "cydno_ERS074413 \t22699\r\n", "cydno_ERS074414 \t22996\r\n", "cydno_ERS074415 \t22966\r\n", "hecale_ERS070252 \t38275\r\n", "hecale_ERS070253 \t38129\r\n", "hecale_ERS070254 \t38378\r\n", "hecale_ERS070255 \t38013\r\n", "hecale_ERS070256 \t37917\r\n", "heurippa_ERS074398 \t38458\r\n", "heurippa_ERS074399 \t38354\r\n", "heurippa_ERS074400 \t37843\r\n", "heurippa_ERS074401 \t38376\r\n", "heurippa_ERS074402 \t38231\r\n", "heurippa_ERS074403 \t38201\r\n", "melpomene_ERS070268\t40055\r\n", "melpomene_ERS070269\t40284\r\n", "melpomene_ERS070270\t39727\r\n", "melpomene_ERS070271\t39908\r\n", "melpomene_ERS070272\t38809\r\n", "melpomene_ERS070273\t40151\r\n", "melpomene_ERS070274\t36786\r\n", "melpomene_ERS070275\t40068\r\n", "melpomene_ERS070276\t39807\r\n", "melpomene_ERS070277\t37965\r\n", "melpomene_ERS074416\t33006\r\n", "melpomene_ERS074417\t36365\r\n", "melpomene_ERS074418\t36483\r\n", "melpomene_ERS074419\t35993\r\n", "melpomene_ERS074420\t38888\r\n", "melpomene_ERS074421\t38023\r\n", "melpomene_ERS074422\t37885\r\n", "melpomene_ERS074423\t38333\r\n", "melpomene_ERS074424\t30796\r\n", "melpomene_ERS074425\t33266\r\n", "melpomene_ERS074426\t35757\r\n", "melpomene_ERS074427\t37341\r\n", "melpomene_ERS074428\t32355\r\n", "melpomene_ERS074429\t31705\r\n", "timareta_ERS070245 \t40166\r\n", "timareta_ERS070246 \t38704\r\n", "timareta_ERS070247 \t37990\r\n", "timareta_ERS070248 \t39932\r\n", "timareta_ERS070249 \t40204\r\n", "timareta_ERS070250 \t38882\r\n", "timareta_ERS070251 \t37739\r\n", "\r\n", "\r\n", "## nloci = number of loci with data for exactly ntaxa\r\n", "## ntotal = number of loci for which at least ntaxa have data\r\n", "ntaxa\tnloci\tsaved\tntotal\r\n", "1\t-\r\n", "2\t-\t\t-\r\n", "3\t-\t\t-\r\n", "4\t26561\t*\t119819\r\n", "5\t13321\t*\t93258\r\n", "6\t6556\t*\t79937\r\n", "7\t5141\t*\t73381\r\n", "8\t4286\t*\t68240\r\n", "9\t3795\t*\t63954\r\n", "10\t3316\t*\t60159\r\n", "11\t2938\t*\t56843\r\n", "12\t2831\t*\t53905\r\n", "13\t2541\t*\t51074\r\n", "14\t2294\t*\t48533\r\n", "15\t2163\t*\t46239\r\n", "16\t1941\t*\t44076\r\n", "17\t1816\t*\t42135\r\n", "18\t1740\t*\t40319\r\n", "19\t1691\t*\t38579\r\n", "20\t1663\t*\t36888\r\n", "21\t1568\t*\t35225\r\n", "22\t1426\t*\t33657\r\n", "23\t1412\t*\t32231\r\n", "24\t1368\t*\t30819\r\n", "25\t1170\t*\t29451\r\n", "26\t1217\t*\t28281\r\n", "27\t1130\t*\t27064\r\n", "28\t1122\t*\t25934\r\n", "29\t1032\t*\t24812\r\n", "30\t1028\t*\t23780\r\n", "31\t1011\t*\t22752\r\n", "32\t1003\t*\t21741\r\n", "33\t947\t*\t20738\r\n", "34\t922\t*\t19791\r\n", "35\t881\t*\t18869\r\n", "36\t838\t*\t17988\r\n", "37\t806\t*\t17150\r\n", "38\t811\t*\t16344\r\n", "39\t802\t*\t15533\r\n", "40\t849\t*\t14731\r\n", "41\t800\t*\t13882\r\n", "42\t849\t*\t13082\r\n", "43\t864\t*\t12233\r\n", "44\t902\t*\t11369\r\n", "45\t959\t*\t10467\r\n", "46\t746\t*\t9508\r\n", "47\t794\t*\t8762\r\n", "48\t903\t*\t7968\r\n", "49\t1171\t*\t7065\r\n", "50\t1834\t*\t5894\r\n", "51\t480\t*\t4060\r\n", "52\t611\t*\t3580\r\n", "53\t969\t*\t2969\r\n", "54\t2000\t*\t2000\r\n", "\r\n", "\r\n", "## nvar = number of loci containing n variable sites (pis+autapomorphies).\r\n", "## sumvar = sum of variable sites (SNPs).\r\n", "## pis = number of loci containing n parsimony informative sites.\r\n", "## sumpis = sum of parsimony informative sites.\r\n", "\tnvar\tsumvar\tPIS\tsumPIS\r\n", "0\t10137\t0\t29807\t0\r\n", "1\t9141\t9141\t16094\t16094\r\n", "2\t8326\t25793\t11449\t38992\r\n", "3\t7494\t48275\t8901\t65695\r\n", "4\t7121\t76759\t7571\t95979\r\n", "5\t6775\t110634\t6655\t129254\r\n", "6\t6705\t150864\t5800\t164054\r\n", "7\t6231\t194481\t5399\t201847\r\n", "8\t6035\t242761\t4762\t239943\r\n", "9\t5580\t292981\t4234\t278049\r\n", "10\t5326\t346241\t3797\t316019\r\n", "11\t4909\t400240\t3255\t351824\r\n", "12\t4535\t454660\t2658\t383720\r\n", "13\t4217\t509481\t2376\t414608\r\n", "14\t3864\t563577\t1919\t441474\r\n", "15\t3381\t614292\t1423\t462819\r\n", "16\t3056\t663188\t1085\t480179\r\n", "17\t2665\t708493\t844\t494527\r\n", "18\t2365\t751063\t594\t505219\r\n", "19\t2096\t790887\t407\t512952\r\n", "20\t1780\t826487\t297\t518892\r\n", "21\t1596\t860003\t168\t522420\r\n", "22\t1319\t889021\t129\t525258\r\n", "23\t1131\t915034\t85\t527213\r\n", "24\t912\t936922\t39\t528149\r\n", "25\t706\t954572\t32\t528949\r\n", "26\t563\t969210\t17\t529391\r\n", "27\t465\t981765\t9\t529634\r\n", "28\t394\t992797\t5\t529774\r\n", "29\t277\t1000830\t7\t529977\r\n", "30\t228\t1007670\t1\t530007\r\n", "31\t152\t1012382\t0\t530007\r\n", "32\t118\t1016158\t0\t530007\r\n", "33\t88\t1019062\t0\t530007\r\n", "34\t48\t1020694\t0\t530007\r\n", "35\t28\t1021674\t0\t530007\r\n", "36\t13\t1022142\t0\t530007\r\n", "37\t15\t1022697\t0\t530007\r\n", "38\t7\t1022963\t0\t530007\r\n", "39\t5\t1023158\t0\t530007\r\n", "40\t4\t1023318\t0\t530007\r\n", "41\t3\t1023441\t0\t530007\r\n", "42\t3\t1023567\t0\t530007\r\n", "43\t1\t1023610\t0\t530007\r\n", "44\t1\t1023654\t0\t530007\r\n", "45\t0\t1023654\t0\t530007\r\n", "46\t1\t1023700\t0\t530007\r\n", "47\t0\t1023700\t0\t530007\r\n", "48\t1\t1023748\t0\t530007\r\n", "49\t0\t1023748\t0\t530007\r\n", "50\t1\t1023798\t0\t530007\r\n", "total var= 1023798\r\n", "total pis= 530007\r\n", "sampled unlinked SNPs= 109682\r\n", "sampled unlinked bi-allelic SNPs= 89479\r\n" ] } ], "source": [ "cat empirical_5/stats/empirical_5_m4.stats\n" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "218875 ## loci with > minsp containing data\n", "212818 ## loci with > minsp containing data & paralogs removed\n", "212818 ## loci with > minsp containing data & paralogs removed & final filtering\n", "\n", "## number of loci recovered in final data set for each taxon.\n", "taxon\tnloci\n", "cydno_ERS074404 \t40114\n", "cydno_ERS074405 \t41446\n", "cydno_ERS074406 \t40560\n", "cydno_ERS074407 \t40694\n", "cydno_ERS074408 \t40758\n", "cydno_ERS074409 \t41320\n", "cydno_ERS074410 \t35272\n", "cydno_ERS074411 \t32803\n", "cydno_ERS074412 \t51705\n", "cydno_ERS074413 \t47291\n", "cydno_ERS074414 \t52449\n", "cydno_ERS074415 \t52917\n" ] } ], "source": [ "%%bash\n", "head -n 20 empirical_5/stats/empirical_5_m2.stats\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Infer ML phylogeny in _raxml_ as an unrooted tree" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "%%bash\n", "## raxml argumement w/ ...\n", "raxmlHPC-PTHREADS-AVX -f a -m GTRGAMMA -N 100 -x 12345 -p 12345 -T 20 \\\n", " -w /home/deren/Documents/RADmissing/empirical_5/ \\\n", " -n empirical_5_m4 -s empirical_5/outfiles/empirical_5_m4.phy\n", " " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "%%bash\n", "## raxml argumement w/ ...\n", "raxmlHPC-PTHREADS-AVX -f a -m GTRGAMMA -N 100 -x 12345 -p 12345 -T 20 \\\n", " -w /home/deren/Documents/RADmissing/empirical_5/ \\\n", " -n empirical_5_m2 -s empirical_5/outfiles/empirical_5_m2.phy\n", " " ] }, { "cell_type": "code", "execution_count": 96, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "This is RAxML version 8.0.16 released by Alexandros Stamatakis on March 21 2014.\n", "\n", "With greatly appreciated code contributions by:\n", "Andre Aberer (HITS)\n", "Simon Berger (HITS)\n", "Alexey Kozlov (HITS)\n", "Nick Pattengale (Sandia)\n", "Wayne Pfeiffer (SDSC)\n", "Akifumi S. Tanabe (NRIFS)\n", "David Dao (KIT)\n", "Charlie Taylor (UF)\n", "\n", "\n", "Alignment has 546632 distinct alignment patterns\n", "\n", "Proportion of gaps and completely undetermined characters in this alignment: 70.04%\n", "\n", "RAxML rapid bootstrapping and subsequent ML search\n" ] } ], "source": [ "%%bash \n", "head -n 20 empirical_5/RAxML_info.empirical_5_m4" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "This is RAxML version 8.0.16 released by Alexandros Stamatakis on March 21 2014.\n", "\n", "With greatly appreciated code contributions by:\n", "Andre Aberer (HITS)\n", "Simon Berger (HITS)\n", "Alexey Kozlov (HITS)\n", "Nick Pattengale (Sandia)\n", "Wayne Pfeiffer (SDSC)\n", "Akifumi S. Tanabe (NRIFS)\n", "David Dao (KIT)\n", "Charlie Taylor (UF)\n", "\n", "\n", "Alignment has 338778 distinct alignment patterns\n", "\n", "Proportion of gaps and completely undetermined characters in this alignment: 80.98%\n", "\n", "RAxML rapid bootstrapping and subsequent ML search\n" ] } ], "source": [ "%%bash \n", "head -n 20 empirical_5/RAxML_info.empirical_5_m2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Get phylo distance (GTRgamma dist)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "%load_ext rpy2.ipython" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[1] 0.05029949\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%R\n", "library(ape)\n", "ltre <- read.tree()\n", "mean(cophenetic.phylo(ltre))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.12" } }, "nbformat": 4, "nbformat_minor": 0 }