{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# ipyrad-analysis toolkit: mrBayes\n", "\n", "Bayesian phylogenetic inference can provide several advantages over ML approaches, particularly with regards to inferring dated trees (separating rate and time), and because they assess support differently, using a posterior sample of trees inferred from the full data set as opposed to bootstrap resampling of the data set. This may be particularly relevant to RAD-seq data that contains missing data. \n", "\n", "Bayesian inference programs often contain waaay too many options, which make them difficult to use. But it's kind of necessary in order to make informed decisions when setting priors on parameters, as opposed to treating the analysis like a black box. That being said, I've written a sort of blackbox tool for running mrbayes analyses. Once you've established a set of parameters that are best for your analysis this tool is useful to then automate it across many loci (e.g., see [`ipa.treeslider()`](https://ipyrad.readthedocs.io/en/latest/API-analysis/cookbook-treeslider.html)). " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Required software" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# conda install ipyrad -c conda-forge -c bioconda\n", "# conda install toytree -c conda-forge\n", "# conda install mrbayes -c conda-forge -c bioconda" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "import ipyrad.analysis as ipa\n", "import toytree" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MrBayes, Bayesian Analysis of Phylogeny\r\n", "\r\n", "Version: 3.2.7\r\n", "Features: SSE AVX FMA Beagle readline\r\n", "Host type: x86_64-conda-linux-gnu (CPU: x86_64)\r\n", "Compiler: gnu 7.5.0\r\n" ] } ], "source": [ "# check mrbayes version\n", "! mb -v" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sequence database file" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "# the seqs.hdf5 file produced by ipyrad\n", "SEQSFILE = \"/tmp/oaks.seqs.hdf5\"" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "file already exists\n" ] } ], "source": [ "# download example seqs file if not already present (~500Mb, takes ~5 minutes)\n", "URL = \"https://www.dropbox.com/s/c1u89nwuuv8e6ie/virentes_ref.seqs.hdf5?raw=1\"\n", "ipa.download(URL, path=SEQSFILE);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Write a nexus alignment (using window extracter)\n", "This tool will extract RAD loci mapping to the first scaffold and with no missing data across the selected 10 samples and write the sequence data to a nexus file. See the window_extracter docs for more details." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
scaffoldstartendsitessnpsmissingsamples
prefilterQrob_Chr010None89074761330.468
postfilterQrob_Chr010None21913625250.008
\n", "
" ], "text/plain": [ " scaffold start end sites snps missing samples\n", "prefilter Qrob_Chr01 0 None 890747 6133 0.46 8\n", "postfilter Qrob_Chr01 0 None 219136 2525 0.00 8" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Wrote data to /tmp/oaks-chr1-aln.nex\n" ] } ], "source": [ "# samples in include\n", "IMAP = {\n", " 'include': [\n", " \"LALC2\", \"FLSF47\", \"FLCK18\", \"CUSV6\",\n", " \"CRL0030\", \"MXED8\", \"BJSB3\", \"AR\"\n", " ],\n", "}\n", "\n", "# init wex tool to select data\n", "wex = ipa.window_extracter(\n", " data=SEQSFILE,\n", " name=\"oaks-chr1-aln\",\n", " workdir=\"/tmp\",\n", " scaffold_idxs=0,\n", " mincov=1.0,\n", " imap=IMAP,\n", ")\n", "\n", "# show stats for this alignment\n", "display(wex.stats)\n", "\n", "# write the alignment file\n", "wex.run(nexus=True, force=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Setting mb param settings\n", "\n", "The purpose of the ipyrad wrapper tool for mrbayes is merely for convenience. You should still take care to understand the priors and parameters that you are selecting by reading the mrbayes documentation. We only support two relatively simple models, a clock and non-clock model with a number of parameters listed in the `.params` attribute of the analysis object. Here I demonstrate the clock model with uncorrelated lognormal distributed rate variation and a birth-death tree prior. This run, which samples 5M iterations takes about 20 hours on 4 cores. " ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "# the path to your NEXUS formatted file\n", "NEXFILE = \"/tmp/oaks-chr1-aln.nex\"" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "# init mrbayes object with input data and (optional) parameter options\n", "mb = ipa.mrbayes(\n", " name=\"oaks-chr1-mb\",\n", " data=NEXFILE,\n", " clock_model=2,\n", " ngen=int(5e6),\n", " samplefreq=int(5e3),\n", " nruns=2,\n", " constraints=\n", ")" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "brlenspr clock:uniform \n", "clockratepr normal(0.01,0.005) \n", "clockvarpr igr \n", "igrvarpr exp(10.0) \n", "nchains 4 \n", "ngen 5000000 \n", "nruns 1 \n", "samplefreq 5000 \n", "topologypr uniform " ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# modify a param and show all params\n", "mb.params.nruns = 1\n", "mb.params" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "#NEXUS\n", "\n", "[log block]\n", "log start filename=/home/deren/Documents/ipyrad/testdocs/analysis/analysis-mb/oaks-chr1-mb.nex.log replace;\n", "\n", "[data block]\n", "execute /tmp/oaks-chr1-aln.nex;\n", "\n", "[tree block]\n", "\n", "\n", "[mb block]\n", "begin mrbayes;\n", " set autoclose=yes nowarn=yes;\n", "\n", " lset nst=6 rates=gamma;\n", "\n", " prset brlenspr=clock:uniform;\n", " prset clockvarpr=igr;\n", " prset igrvarpr=exp(10.0);\n", " prset clockratepr=normal(0.01,0.005);\n", " prset topologypr=uniform;\n", "\n", " \n", "\n", " mcmcp ngen=5000000 nrun=1 nchains=4;\n", " mcmcp relburnin=yes burninfrac=0.25;\n", " mcmcp samplefreq=5000;\n", " mcmcp printfreq=10000 diagnfr=5000;\n", " mcmcp filename=/home/deren/Documents/ipyrad/testdocs/analysis/analysis-mb/oaks-chr1-mb.nex;\n", " mcmc;\n", "\n", " sump filename=/home/deren/Documents/ipyrad/testdocs/analysis/analysis-mb/oaks-chr1-mb.nex;\n", " sumt filename=/home/deren/Documents/ipyrad/testdocs/analysis/analysis-mb/oaks-chr1-mb.nex contype=allcompat;\n", "end;\n", "\n", "[log block]\n", "log stop filename=/home/deren/Documents/ipyrad/testdocs/analysis/analysis-mb/oaks-chr1-mb.nex.log append;\n", "\n" ] } ], "source": [ "# print the mb nexus string; this can be modified by changing .params\n", "print(mb.nexus_string)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Run mrbayes\n", "\n", "This takes about 20 minutes to run on my laptop for a data set with 13 samples and ~1.2M SNPs and about 14% missing data. For a publication quality analyses you will likely want to run it quite a bit longer and to test for convergence of your runs using a tool like `tracer` and `treeannotator`." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "job pedicularis-min10-5M finished successfully\n" ] } ], "source": [ "# run the command, (options: block until finishes; overwrite existing)\n", "mb.run(block=True, force=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Accessing results\n", "\n", "You can access the results files of the mrbayes analysis from the mb analysis object from the `.trees` attribute, or of course you can also write out the full path to the result files which will be located in your working directory which can be set as a parameter option, otherwise is defaulted to `\"./analysis-mb/\".`" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "constre ~/Documents/ipyrad/newdocs/API-analysis/analysis-mb/pedicularis-min10-5M.nex.con.tre\n", "info ~/Documents/ipyrad/newdocs/API-analysis/analysis-mb/pedicularis-min10-5M.nex.lstat\n", "posttre ~/Documents/ipyrad/newdocs/API-analysis/analysis-mb/pedicularis-min10-5M.nex.t" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# you can access the tree results from the mb analysis object\n", "mb.trees" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'/home/deren/Documents/ipyrad/newdocs/API-analysis/analysis-mb/pedicularis-min10-5M.nex.con.tre'" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# for example to select the consensus tree path\n", "mb.trees.constre" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Check the parameter values and convergence\n", "\n", "On this very large data set (in terms of number of sites) we can see that the posterior has not explored the parameter space very well by the end of this run. We would hope to see the ESS score for all parameters above 100. Here the TH, TL, net_speciation, and clockrate parameters have low convergence scores. " ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
MeanVarianceLowerUpperMedianESS
Parameter
TH0.0550.0000.0450.0670.0544.014
TL0.4430.0050.3280.5640.4384.031
r(A<->C)0.1020.0000.1000.1050.102325.813
r(A<->G)0.2860.0000.2830.2890.286267.213
r(A<->T)0.1120.0000.1090.1140.112341.615
r(C<->G)0.1020.0000.1000.1050.102376.000
r(C<->T)0.2960.0000.2930.2990.296376.000
r(G<->T)0.1020.0000.1000.1040.102376.000
pi(A)0.2970.0000.2960.2980.297238.441
pi(C)0.2070.0000.2060.2070.207331.586
pi(G)0.2080.0000.2070.2090.208288.870
pi(T)0.2880.0000.2880.2890.288239.621
alpha0.0180.0000.0000.0380.018182.125
net_speciation0.4060.0160.1890.6210.40010.388
relative_extinction0.0100.0000.0000.0240.009367.141
tk02var13.4757.8378.49518.63213.420220.202
clockrate0.0030.0000.0020.0050.00314.291
\n", "
" ], "text/plain": [ " Mean Variance Lower Upper Median ESS\n", "Parameter \n", "TH 0.055 0.000 0.045 0.067 0.054 4.014\n", "TL 0.443 0.005 0.328 0.564 0.438 4.031\n", "r(A<->C) 0.102 0.000 0.100 0.105 0.102 325.813\n", "r(A<->G) 0.286 0.000 0.283 0.289 0.286 267.213\n", "r(A<->T) 0.112 0.000 0.109 0.114 0.112 341.615\n", "r(C<->G) 0.102 0.000 0.100 0.105 0.102 376.000\n", "r(C<->T) 0.296 0.000 0.293 0.299 0.296 376.000\n", "r(G<->T) 0.102 0.000 0.100 0.104 0.102 376.000\n", "pi(A) 0.297 0.000 0.296 0.298 0.297 238.441\n", "pi(C) 0.207 0.000 0.206 0.207 0.207 331.586\n", "pi(G) 0.208 0.000 0.207 0.209 0.208 288.870\n", "pi(T) 0.288 0.000 0.288 0.289 0.288 239.621\n", "alpha 0.018 0.000 0.000 0.038 0.018 182.125\n", "net_speciation 0.406 0.016 0.189 0.621 0.400 10.388\n", "relative_extinction 0.010 0.000 0.000 0.024 0.009 367.141\n", "tk02var 13.475 7.837 8.495 18.632 13.420 220.202\n", "clockrate 0.003 0.000 0.002 0.005 0.003 14.291" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# show the convergence statistics (from the .pstat mb output file)\n", "mb.convergence_stats.round(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Plotting mb consesus tree\n", "You can plot either the consensus tree (`.nex.cons.tre`) or the posterior distribution of trees (`.nex.t`) produced by a mrbayes run by loading the single tree or list of trees with `toytree` as demonstrated below. " ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
39618_rex_SRR175472338362_rex_SRR175472535236_rex_SRR175473140578_rex_SRR175472435855_rex_SRR175472630556_thamno_SRR175472033413_thamno_SRR175472841954_cyathophylloides_SRR175472141478_cyathophylloides_SRR175472230686_cyathophylla_SRR175473029154_superba_SRR175471533588_przewalskii_SRR175472732082_przewalskii_SRR1754729061218
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# load from the .trees attribute or from the saved tree file\n", "tre = toytree.tree(mb.trees.constre, tree_format=10)\n", "\n", "# draw the tree with styling\n", "tre.draw(\n", " tip_labels_align=True, \n", " scalebar=True,\n", " #node_labels=\"support\",\n", ");" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Plotting mb posterior distribution of trees" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
38362_rex_SRR175472539618_rex_SRR175472335236_rex_SRR175473140578_rex_SRR175472435855_rex_SRR175472630556_thamno_SRR175472033413_thamno_SRR175472841478_cyathophylloides_SRR175472241954_cyathophylloides_SRR175472129154_superba_SRR175471530686_cyathophylla_SRR175473032082_przewalskii_SRR175472933588_przewalskii_SRR175472738362_rex_SRR175472539618_rex_SRR175472335236_rex_SRR175473135855_rex_SRR175472640578_rex_SRR175472430556_thamno_SRR175472033413_thamno_SRR175472829154_superba_SRR175471530686_cyathophylla_SRR175473041478_cyathophylloides_SRR175472241954_cyathophylloides_SRR175472133588_przewalskii_SRR175472732082_przewalskii_SRR175472938362_rex_SRR175472539618_rex_SRR175472335236_rex_SRR175473140578_rex_SRR175472435855_rex_SRR175472630556_thamno_SRR175472033413_thamno_SRR175472830686_cyathophylla_SRR175473029154_superba_SRR175471541954_cyathophylloides_SRR175472141478_cyathophylloides_SRR175472233588_przewalskii_SRR175472732082_przewalskii_SRR175472938362_rex_SRR175472539618_rex_SRR175472335236_rex_SRR175473135855_rex_SRR175472640578_rex_SRR175472430556_thamno_SRR175472033413_thamno_SRR175472829154_superba_SRR175471530686_cyathophylla_SRR175473041478_cyathophylloides_SRR175472241954_cyathophylloides_SRR175472133588_przewalskii_SRR175472732082_przewalskii_SRR17547290.000.020.03
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# load from the .trees attribute or from the saved tree file\n", "mtre = toytree.mtree(mb.trees.posttre)\n", "\n", "# remove the burnin manually\n", "mtre.treelist = mtre.treelist[10:]\n", "\n", "# draw a few trees on a shared axis\n", "mtre.draw_tree_grid(\n", " nrows=1, ncols=4, start=20, shared_axis=True,\n", " height=400, \n", ");" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The command below plots a cloud tree diagram across the ~90 trees in the posterior distribution. Because we have no fossil calibrations for this data the root age is not of great interest so I use the `.mod` fuction below to scale the root height of each tree to 1.0. This will highlight variation among trees in relative node heights. Below you can see that in the posterior set of trees there some variation around node heights in the middle of the tree, and there does not appear to be any variation in the topology recovered, which is not uncommon for large concatenated data sets. " ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [], "source": [ "# set root height of all trees to 1.0\n", "mtre.treelist = [i.mod.node_scale_root_height(1.0) for i in mtre.treelist]" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
38362_rex39618_rex35236_rex35855_rex40578_rex30556_thamno33413_thamno29154_superba30686_cyathophylla41478_cyathophylloides41954_cyathophylloides33588_przewalskii32082_przewalskii
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# draw the posterior of trees overlapping\n", "mtre.draw_cloud_tree(\n", " width=400, height=300,\n", " html=True,\n", " edge_colors=\"red\",\n", " edge_style={\"stroke-opacity\": 0.02},\n", " tip_labels={i: i.rsplit(\"_\", 1)[0] for i in mtre.treelist[0].get_tip_labels()},\n", ");" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Save figures" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "import toyplot.pdf\n", "canvas, axes = mtre.draw_cloud_tree(\n", " width=400, height=300,\n", " html=True,\n", " edge_colors=\"red\",\n", " tip_labels={i: i.rsplit(\"_\", 1)[0] for i in mtre.treelist[0].get_tip_labels()}\n", ");\n", "toyplot.pdf.render(canvas, \"./pedicularis-min10-5M-mb-tree.pdf\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### What else?\n", "\n", "If you have reference mapped data then you should see the `.treeslider()` tool to infer trees in sliding windows along scaffolds; or the `.extractor()` tool to extract, filter, and concatenate RAD loci within a given window (e.g., near some known gene)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.9" } }, "nbformat": 4, "nbformat_minor": 2 }