{ "cells": [ { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "## (📗) ipyrad Cookbook: `abba-baba` admixture tests\n", "\n", "The `ipyrad.analysis` Python module includes functions to calculate abba-baba admixture statistics (including several variants of these measures), to perform signifance tests, and to produce plots of results. All code in this notebook is written in Python, which you can copy/paste into an IPython terminal to execute, or, preferably, run in a Jupyter notebook like this one. See the other analysis cookbooks for [instructions](http://ipyrad.readthedocs.io/analysis.html) on using Jupyter notebooks. All of the software required for this tutorial is included with `ipyrad` (v.6.12+). Finally, we've written functions to generate plots for summarizing and interpreting results. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load packages" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import ipyrad.analysis as ipa\n", "import ipyparallel as ipp\n", "import toytree\n", "import toyplot" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.7.19\n", "0.16.0-dev\n", "0.1.4\n" ] } ], "source": [ "print ipa.__version__\n", "print toyplot.__version__\n", "print toytree.__version__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Connect to cluster\n", "The code can be easily parallelized across cores on your machine, or many nodes of an HPC cluster using the `ipyparallel` library (see our [ipyparallel tutorial]()). An `ipcluster` instance must be started for you to connect to, which can be started by running `'ipcluster start'` in a terminal. " ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "4" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipyclient = ipp.Client()\n", "len(ipyclient)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load in your .loci data file and a tree hypothesis\n", "We are going to use the shape of our tree topology hypothesis to generate 4-taxon tests to perform, therefore we'll start by looking at our tree and making sure it is properly rooted. " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "## ipyrad and raxml output files\n", "locifile = \"./analysis-ipyrad/pedic_outfiles/pedic.loci\"\n", "newick = \"./analysis-raxml/RAxML_bipartitions.pedic\"" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
38362_rex39618_rex35236_rex35855_rex40578_rex30556_thamno33413_thamno41954_cyathophylloides41478_cyathophylloides30686_cyathophylla29154_superba33588_przewalskii32082_przewalskiiidx: 13\n", "name: i13\n", "dist: 0.00617527349892\n", "support: 100100idx: 14\n", "name: i14\n", "dist: 0.000738900380519\n", "support: 9696idx: 15\n", "name: i15\n", "dist: 0.00222999650239\n", "support: 100100idx: 16\n", "name: i16\n", "dist: 0.000783365499905\n", "support: 9999idx: 17\n", "name: i17\n", "dist: 0.00103379657491\n", "support: 100100idx: 20\n", "name: i18\n", "dist: 0.00538723112355\n", "support: 100100idx: 18\n", "name: i20\n", "dist: 0.00941020878048\n", "support: 100100idx: 19\n", "name: i21\n", "dist: 0.00237994755604\n", "support: 100100idx: 21\n", "name: i22\n", "dist: 0.00297626149201\n", "support: 100100idx: 22\n", "name: added-node\n", "dist: 0.0358742260975\n", "support: 100100idx: 23\n", "name: i19\n", "dist: 0.0358742260975\n", "support: 100100
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "## parse the newick tree, re-root it, and plot it.\n", "tre = toytree.tree(newick=newick)\n", "tre.root(wildcard=\"prz\")\n", "tre.draw(\n", " height=350,\n", " width=400,\n", " node_labels=tre.get_node_values(\"support\")\n", " )\n", "\n", "## store rooted tree back into a newick string.\n", "newick = tre.tree.write()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Short tutorial: calculating abba-baba statistics\n", "To give a gist of what this code can do, here is a quick tutorial version, each step of which we explain in greater detail below. We first create a `'baba'` analysis object that is linked to our data file, in this example we name the variable **bb**. Then we tell it which tests to perform, here by automatically generating a number of tests using the `generate_tests_from_tree()` function. And finally, we calculate the results and plot them. " ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "## create a baba object linked to a data file and newick tree\n", "bb = ipa.baba(data=locifile, newick=newick)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "44 tests generated from tree\n" ] } ], "source": [ "## generate all possible abba-baba tests meeting a set of constraints\n", "bb.generate_tests_from_tree(\n", " constraint_dict={\n", " \"p4\": [\"32082_przewalskii\", \"33588_przewalskii\"],\n", " \"p3\": [\"33413_thamno\"],\n", " })" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'p1': ['41478_cyathophylloides'],\n", " 'p2': ['29154_superba', '30686_cyathophylla'],\n", " 'p3': ['33413_thamno'],\n", " 'p4': ['32082_przewalskii', '33588_przewalskii']},\n", " {'p1': ['41954_cyathophylloides'],\n", " 'p2': ['29154_superba', '30686_cyathophylla'],\n", " 'p3': ['33413_thamno'],\n", " 'p4': ['32082_przewalskii', '33588_przewalskii']},\n", " {'p1': ['41478_cyathophylloides'],\n", " 'p2': ['29154_superba'],\n", " 'p3': ['33413_thamno'],\n", " 'p4': ['32082_przewalskii', '33588_przewalskii']}]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## show the first 3 tests\n", "bb.tests[:3]" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[####################] 100% calculating D-stats | 0:02:58 | \n" ] } ], "source": [ "## run all tests linked to bb \n", "bb.run(ipyclient)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dstatbootmeanbootstdZABBABABAnloci
00.0890.0890.0362.485436.125365.0008721
10.0960.0980.0372.640425.062350.2508267
20.1010.1020.0442.301329.938269.3756573
30.1140.1140.0432.623319.312254.1256255
40.1240.1240.0393.188400.250312.1888026
\n", "
" ], "text/plain": [ " dstat bootmean bootstd Z ABBA BABA nloci\n", "0 0.089 0.089 0.036 2.485 436.125 365.000 8721\n", "1 0.096 0.098 0.037 2.640 425.062 350.250 8267\n", "2 0.101 0.102 0.044 2.301 329.938 269.375 6573\n", "3 0.114 0.114 0.043 2.623 319.312 254.125 6255\n", "4 0.124 0.124 0.039 3.188 400.250 312.188 8026" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## show first 5 results\n", "bb.results_table.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Look at the results\n", "By default we do not attach the names of the samples that were included in each test to the results table since it makes the table much harder to read, and we wanted it to look very clean. However, this information is readily available in the `.test()` attribute of the baba object as shown below. Also, we have made plotting functions to show this information clearly as well. " ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dstatbootmeanbootstdZABBABABAnloci
160.2900.2900.0309.531606.312333.8128937
150.2390.2380.0288.492608.281373.3659266
170.1990.1990.0326.311550.062367.3129033
190.2040.2050.0336.120545.375360.9388925
200.1600.1610.0305.383499.766362.0479351
\n", "
" ], "text/plain": [ " dstat bootmean bootstd Z ABBA BABA nloci\n", "16 0.290 0.290 0.030 9.531 606.312 333.812 8937\n", "15 0.239 0.238 0.028 8.492 608.281 373.365 9266\n", "17 0.199 0.199 0.032 6.311 550.062 367.312 9033\n", "19 0.204 0.205 0.033 6.120 545.375 360.938 8925\n", "20 0.160 0.161 0.030 5.383 499.766 362.047 9351" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## save all results table to a tab-delimited CSV file\n", "bb.results_table.to_csv(\"bb.abba-baba.csv\", sep=\"\\t\")\n", "\n", "## show the results table sorted by index score (Z)\n", "sorted_results = bb.results_table.sort_values(by=\"Z\", ascending=False)\n", "sorted_results.head()" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
p1p2p3p4
16[35236_rex][30556_thamno][33413_thamno][32082_przewalskii, 33588_przewalskii]
15[35236_rex, 39618_rex, 38362_rex][30556_thamno][33413_thamno][32082_przewalskii, 33588_przewalskii]
17[39618_rex, 38362_rex][30556_thamno][33413_thamno][32082_przewalskii, 33588_przewalskii]
19[38362_rex][30556_thamno][33413_thamno][32082_przewalskii, 33588_przewalskii]
20[35236_rex][40578_rex, 35855_rex][33413_thamno][32082_przewalskii, 33588_przewalskii]
\n", "
" ], "text/plain": [ " p1 p2 p3 \\\n", "16 [35236_rex] [30556_thamno] [33413_thamno] \n", "15 [35236_rex, 39618_rex, 38362_rex] [30556_thamno] [33413_thamno] \n", "17 [39618_rex, 38362_rex] [30556_thamno] [33413_thamno] \n", "19 [38362_rex] [30556_thamno] [33413_thamno] \n", "20 [35236_rex] [40578_rex, 35855_rex] [33413_thamno] \n", "\n", " p4 \n", "16 [32082_przewalskii, 33588_przewalskii] \n", "15 [32082_przewalskii, 33588_przewalskii] \n", "17 [32082_przewalskii, 33588_przewalskii] \n", "19 [32082_przewalskii, 33588_przewalskii] \n", "20 [32082_przewalskii, 33588_przewalskii] " ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## get taxon names in the sorted results order\n", "sorted_taxa = bb.taxon_table.iloc[sorted_results.index]\n", "\n", "## show taxon names in the first few sorted tests\n", "sorted_taxa.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Plotting and interpreting results\n", "Interpreting the results of D-statistic tests is actually *very* complicated. You cannot treat every test as if it were independent because introgression between one pair of species may cause one or both of those species to *appear* as if they have also introgressed with other taxa in your data set. This problem is described in great detail in [this paper (Eaton et al. 2015)](http://onlinelibrary.wiley.com/doi/10.1111/evo.12758/abstract). A good place to start, then, is to perform many tests and focus on those which have the strongest signal of admixture. Then, perform additional tests, such as `partitioned D-statistics` (described further below) to tease apart whether a single or multiple introgression events are likely to have occurred. \n", "\n", "In the example plot below we find evidence of admixture between the sample **33413_thamno** (black) with several other samples, but the signal is strongest with respect to **30556_thamno** (tests 12-19). It also appears that admixture is consistently detected with samples of (**40578_rex** & **35855_rex**) when contrasted against **35236_rex** (tests 20, 24, 28, 34, and 35). Take note, the tests are indexed starting at 0." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
01234567891011121314151617181920212223242526272829303132333435363738394041424332082_przewalskii33588_przewalskii29154_superba30686_cyathophylla41478_cyathophylloides41954_cyathophylloides33413_thamno30556_thamno40578_rex35855_rex35236_rex39618_rex38362_rex11.00.0-0.40.4Z-scoresD-statistics
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "## plot results on the tree\n", "bb.plot(height=850, width=700, pct_tree_y=0.2, pct_tree_x=0.5, alpha=4.0);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### generating tests\n", "Because tests are generated based on a tree file, it will only generate tests that fit the topology of the test. For example, the entries below generate zero possible tests because the two samples entered for P3 (the two thamnophila subspecies) are paraphyletic on the tree topology, and therefore cannot form a clade together. " ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 tests generated from tree\n" ] } ], "source": [ "## this is expected to generate zero tests\n", "aa = bb.copy()\n", "aa.generate_tests_from_tree(\n", " constraint_dict={\n", " \"p4\": [\"32082_przewalskii\", \"33588_przewalskii\"],\n", " \"p3\": [\"33413_thamno\", \"30556_thamno\"],\n", " })" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you want to get results for a test that does not fit on your tree you can always write the result out by hand instead of auto-generating it from the tree. Doing it this way is fine when you have few tests to run, but becomes burdensome when writing many tests. " ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[####################] 100% calculating D-stats | 0:00:23 | \n" ] }, { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dstatbootmeanbootstdZABBABABAnloci
00.0500.0500.0222.291939.172850.50015820
10.1630.1630.0217.900984.953708.79715576
\n", "
" ], "text/plain": [ " dstat bootmean bootstd Z ABBA BABA nloci\n", "0 0.050 0.050 0.022 2.291 939.172 850.500 15820\n", "1 0.163 0.163 0.021 7.900 984.953 708.797 15576" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## writing tests by hand for a new object\n", "aa = bb.copy()\n", "aa.tests = [\n", " {\"p4\": [\"32082_przewalskii\", \"33588_przewalskii\"],\n", " \"p3\": [\"33413_thamno\", \"30556_thamno\"],\n", " \"p2\": [\"40578_rex\", \"35855_rex\"],\n", " \"p1\": [\"39618_rex\", \"38362_rex\"]},\n", " {\"p4\": [\"32082_przewalskii\", \"33588_przewalskii\"],\n", " \"p3\": [\"33413_thamno\", \"30556_thamno\"],\n", " \"p2\": [\"40578_rex\", \"35855_rex\"],\n", " \"p1\": [\"35236_rex\"]}, \n", " ]\n", "## run the tests\n", "aa.run(ipyclient)\n", "aa.results_table" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Further investigating results with 5-part tests\n", "You can also perform partitioned D-statistic tests like below. Here we are testing the direction of introgression. If the two *thamnophila* subspecies are in fact sister species then they would be expected to share derived alleles that arose in their ancestor and which would be introduced from together if either one of them introgressed into a *P. rex* taxon. As you can see, test 0 shows no evidence of introgression, whereas test 1 shows that the two *thamno* subspecies share introgressed alleles that are present in two samples of *rex* relative to sample \"35236_rex\". \n", "\n", "More on this further below in this notebook. " ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[####################] 100% calculating D-stats | 0:00:23 | \n" ] } ], "source": [ "## further investigate with a 5-part test\n", "cc = bb.copy()\n", "cc.tests = [\n", " {\"p5\": [\"32082_przewalskii\", \"33588_przewalskii\"],\n", " \"p4\": [\"33413_thamno\"],\n", " \"p3\": [\"30556_thamno\"],\n", " \"p2\": [\"40578_rex\", \"35855_rex\"],\n", " \"p1\": [\"39618_rex\", \"38362_rex\"]},\n", " {\"p5\": [\"32082_przewalskii\", \"33588_przewalskii\"],\n", " \"p4\": [\"33413_thamno\"],\n", " \"p3\": [\"30556_thamno\"],\n", " \"p2\": [\"40578_rex\", \"35855_rex\"],\n", " \"p1\": [\"35236_rex\"]}, \n", " ]\n", "cc.run(ipyclient)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DstatbootmeanbootstdZABxxABAxxAnloci
0p3-0.037-0.0350.0410.885230.852248.3528933
p40.0440.0440.0530.840160.125146.5318933
shared0.0200.0200.0250.801449.754431.8958933
1p30.1760.1780.0463.862252.953177.1098840
p40.1350.1340.0522.612159.172121.2668840
shared0.1770.1770.0257.060514.859359.7038840
\n", "
" ], "text/plain": [ " Dstat bootmean bootstd Z ABxxA BAxxA nloci\n", "0 p3 -0.037 -0.035 0.041 0.885 230.852 248.352 8933\n", " p4 0.044 0.044 0.053 0.840 160.125 146.531 8933\n", " shared 0.020 0.020 0.025 0.801 449.754 431.895 8933\n", "1 p3 0.176 0.178 0.046 3.862 252.953 177.109 8840\n", " p4 0.135 0.134 0.052 2.612 159.172 121.266 8840\n", " shared 0.177 0.177 0.025 7.060 514.859 359.703 8840" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## the partitioned D results for two tests\n", "cc.results_table" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
p1p2p3p4p5
0[39618_rex, 38362_rex][40578_rex, 35855_rex][30556_thamno][33413_thamno][32082_przewalskii, 33588_przewalskii]
1[35236_rex][40578_rex, 35855_rex][30556_thamno][33413_thamno][32082_przewalskii, 33588_przewalskii]
\n", "
" ], "text/plain": [ " p1 p2 p3 \\\n", "0 [39618_rex, 38362_rex] [40578_rex, 35855_rex] [30556_thamno] \n", "1 [35236_rex] [40578_rex, 35855_rex] [30556_thamno] \n", "\n", " p4 p5 \n", "0 [33413_thamno] [32082_przewalskii, 33588_przewalskii] \n", "1 [33413_thamno] [32082_przewalskii, 33588_przewalskii] " ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## and view the 5-part test taxon table\n", "cc.taxon_table" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Full Tutorial\n", "\n", "### Creating a `baba` object\n", "\n", "The fundamental object for running abba-baba tests is the `ipa.baba()` object. This stores all of the information about the data, tests, and results of your analysis, and is used to generate plots. If you only have one data file that you want to run many tests on then you will only need to enter the path to your data once. The data file must be a `'.loci'` file from an ipyrad analysis. In general, you will probably want to use the largest data file possible for these tests (`min_samples_locus`=4), to maximize the amount of data available for any test. Once an initial `baba` object is created you create different copies of that object that will inherit its parameter setttings, and which you can use to perform different tests on, like below. " ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "\n" ] } ], "source": [ "## create an initial object linked to your data in 'locifile'\n", "aa = ipa.baba(data=locifile)\n", "\n", "## create two other copies\n", "bb = aa.copy()\n", "cc = aa.copy()\n", "\n", "## print these objects\n", "print aa\n", "print bb\n", "print cc" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Linking tests to the baba object" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The next thing we need to do is to link a `'test'` to each of these objects, or a list of tests. In the [Short tutorial](#Short-tutorial:-calculating-abba-baba-statistics) above we auto-generated a list of tests from an input tree, but to be more explicit about how things work we will write out each test by hand here. A test is described by a Python dictionary that tells it which samples (individuals) should represent the 'p1', 'p2', 'p3', and 'p4' taxa in the ABBA-BABA test. You can see in the example below that we set two samples to represent the outgroup taxon (p4). This means that the SNP frequency for those two samples combined will represent the p4 taxon. For the `baba` object named `'cc'` below we enter two tests using a list to show how multiple tests can be linked to a single `baba` object. " ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "aa.tests = {\n", " \"p4\": [\"32082_przewalskii\", \"33588_przewalskii\"],\n", " \"p3\": [\"29154_superba\"], \n", " \"p2\": [\"33413_thamno\"], \n", " \"p1\": [\"40578_rex\"],\n", "}\n", "\n", "bb.tests = {\n", " \"p4\": [\"32082_przewalskii\", \"33588_przewalskii\"],\n", " \"p3\": [\"30686_cyathophylla\"], \n", " \"p2\": [\"33413_thamno\"], \n", " \"p1\": [\"40578_rex\"],\n", "}\n", "\n", "cc.tests = [\n", " {\n", " \"p4\": [\"32082_przewalskii\", \"33588_przewalskii\"],\n", " \"p3\": [\"41954_cyathophylloides\"], \n", " \"p2\": [\"33413_thamno\"], \n", " \"p1\": [\"40578_rex\"],\n", " },\n", " {\n", " \"p4\": [\"32082_przewalskii\", \"33588_przewalskii\"],\n", " \"p3\": [\"41478_cyathophylloides\"], \n", " \"p2\": [\"33413_thamno\"], \n", " \"p1\": [\"40578_rex\"],\n", " },\n", "]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Other parameters\n", "Each `baba` object has a set of parameters associated with it that are used to filter the loci that will be used in the test and to set some other optional settings. If the `'mincov'` parameter is set to 1 (the default) then loci in the data set will only be used in a test if there is at least one sample from every tip of the tree that has data for that locus. For example, in the tests above where we entered two samples to represent \"p4\" only one of those two samples *needs* to be present for the locus to be included in our analysis. If you want to require that both samples have data at the locus in order for it to be included in the analysis then you could set `mincov=2`. However, for the test above setting `mincov=2` would filter out *all* of the data, since it is impossible to have a coverage of 2 for 'p3', 'p2', and 'p1', since they each have only one sample. Therefore, you can also enter the `mincov` parameter as a dictionary setting a different minimum for each tip taxon, which we demonstrate below for the `baba` object `'bb'`. " ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "database None \n", "mincov 1 \n", "nboots 1000 \n", "quiet False " ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## print params for object aa\n", "aa.params" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "database None \n", "mincov {'p2': 1, 'p3': 1, 'p1': 1, 'p4': 2}\n", "nboots 1000 \n", "quiet False " ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## set the mincov value as a dictionary for object bb\n", "bb.params.mincov = {\"p4\":2, \"p3\":1, \"p2\":1, \"p1\":1}\n", "bb.params" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Running the tests\n", "When you execute the `'run()'` command all of the tests for the object will be distributed to run in parallel on your cluster (or the cores available on your machine) as connected to your `ipyclient` object. The results of the tests will be stored in your `baba` object under the attributes `'results_table'` and `'results_boots'`. " ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[####################] 100% calculating D-stats | 0:00:07 | \n", "[####################] 100% calculating D-stats | 0:00:06 | \n", "[####################] 100% calculating D-stats | 0:00:10 | \n" ] } ], "source": [ "## run tests for each of our objects\n", "aa.run(ipyclient)\n", "bb.run(ipyclient)\n", "cc.run(ipyclient)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The results table\n", "The results of the tests are stored as a data frame (pandas.DataFrame) in `results_table`, which can be easily accessed and manipulated. The tests are listed in order and can be referenced by their `'index'` (the number in the left-most column). For example, below we see the results for object `'cc'` tests 0 and 1. You can see which taxa were used in each test by accessing them from the `.tests` attribute as a dictionary, or as `.taxon_table` which returns it as a dataframe. An even better way to see which individuals were involved in each test, however, is to use our plotting functions, which we describe further below. " ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dstatbootmeanbootstdZABBABABAnloci
0-0.007-0.0090.0440.152238.688241.8758313
1-0.008-0.0080.0410.193248.250252.2508822
\n", "
" ], "text/plain": [ " dstat bootmean bootstd Z ABBA BABA nloci\n", "0 -0.007 -0.009 0.044 0.152 238.688 241.875 8313\n", "1 -0.008 -0.008 0.041 0.193 248.250 252.250 8822" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## you can sort the results by Z-score\n", "cc.results_table.sort_values(by=\"Z\", ascending=False)\n", "\n", "## save the table to a file \n", "cc.results_table.to_csv(\"cc.abba-baba.csv\")\n", "\n", "## show the results in notebook\n", "cc.results_table" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Auto-generating tests\n", "Entering all of the tests by hand can be pain, which is why we wrote functions to auto-generate tests given an input **rooted** tree, and a number of contraints on the tests to generate from that tree. It is important to add constraints on the tests otherwise the number that can be produced becomes very large very quickly. Calculating results runs pretty fast, but summarizing and interpreting thousands of results is pretty much impossible, so it is generally better to limit the tests to those which make some intuitive sense to run. You can see in this example that implementing a few contraints reduces the number of tests from 1608 to 13. " ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2006 tests generated from tree\n", "126 tests generated from tree\n", "14 tests generated from tree\n" ] } ], "source": [ "## create a new 'copy' of your baba object and attach a treefile\n", "dd = bb.copy()\n", "dd.newick = newick\n", "\n", "## generate all possible tests\n", "dd.generate_tests_from_tree()\n", "\n", "## a dict of constraints\n", "constraint_dict={\n", " \"p4\": [\"32082_przewalskii\", \"33588_przewalskii\"],\n", " \"p3\": [\"40578_rex\", \"35855_rex\"],\n", " }\n", "\n", "## generate tests with contraints\n", "dd.generate_tests_from_tree(\n", " constraint_dict=constraint_dict, \n", " constraint_exact=False,\n", ")\n", "\n", "## 'exact' contrainst are even more constrained\n", "dd.generate_tests_from_tree(\n", " constraint_dict=constraint_dict,\n", " constraint_exact=True,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Running the tests\n", "The `.run()` command will run the tests linked to your analysis object. An ipyclient object is required to distribute the jobs in parallel. The `.plot()` function can then optionally be used to visualize the results on a tree. Or, you can simply look at the results in the `.results_table` attribute. " ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[####################] 100% calculating D-stats | 0:01:00 | \n" ] }, { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dstatbootmeanbootstdZABBABABAnloci
00.0710.0710.0342.082415.266360.4069133
10.1200.1210.0353.400421.000330.4848611
20.0850.0880.0412.044327.828276.6096849
30.1290.1290.0442.967326.953252.0476505
40.0960.0970.0372.558376.078310.2668413
50.1350.1350.0383.519380.672290.3597939
6-0.092-0.0900.0402.299278.641335.2346863
7-0.109-0.1090.0372.916310.672386.2978439
8-0.085-0.0830.0441.948276.609327.8286849
9-0.096-0.0960.0382.506310.266376.0788413
10-0.129-0.1300.0433.009252.047326.9536505
11-0.135-0.1340.0383.556290.359380.6727939
12-0.023-0.0230.0320.714435.562455.7508208
13-0.013-0.0140.0300.434509.906523.4389513
\n", "
" ], "text/plain": [ " dstat bootmean bootstd Z ABBA BABA nloci\n", "0 0.071 0.071 0.034 2.082 415.266 360.406 9133\n", "1 0.120 0.121 0.035 3.400 421.000 330.484 8611\n", "2 0.085 0.088 0.041 2.044 327.828 276.609 6849\n", "3 0.129 0.129 0.044 2.967 326.953 252.047 6505\n", "4 0.096 0.097 0.037 2.558 376.078 310.266 8413\n", "5 0.135 0.135 0.038 3.519 380.672 290.359 7939\n", "6 -0.092 -0.090 0.040 2.299 278.641 335.234 6863\n", "7 -0.109 -0.109 0.037 2.916 310.672 386.297 8439\n", "8 -0.085 -0.083 0.044 1.948 276.609 327.828 6849\n", "9 -0.096 -0.096 0.038 2.506 310.266 376.078 8413\n", "10 -0.129 -0.130 0.043 3.009 252.047 326.953 6505\n", "11 -0.135 -0.134 0.038 3.556 290.359 380.672 7939\n", "12 -0.023 -0.023 0.032 0.714 435.562 455.750 8208\n", "13 -0.013 -0.014 0.030 0.434 509.906 523.438 9513" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/html": [ "
01234567891011121332082_przewalskii33588_przewalskii29154_superba30686_cyathophylla41478_cyathophylloides41954_cyathophylloides33413_thamno30556_thamno40578_rex35855_rex35236_rex39618_rex38362_rex5.00.0-0.30.3Z-scoresD-statistics
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "## run the dd tests\n", "dd.run(ipyclient)\n", "dd.plot(height=500, pct_tree_y=0.2, alpha=4);\n", "dd.results_table" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### More about input file paths (i/o)\n", "The default (required) input data file is the `.loci` file produced by `ipyrad`. When performing D-statistic calculations this file will be parsed to retain the maximal amount of information useful for each test. \n", "\n", "An additional (*optional*) file to provide is a newick tree file. While you do not need a tree in order to run ABBA-BABA tests, you do need at least need *a hypothesis* for how your samples are related in order to setup meaningful tests. By loading in a tree for your data set we can use it to easily set up hypotheses to test, and to plot results on the tree." ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": true }, "outputs": [], "source": [ "## path to a locifile created by ipyrad\n", "locifile = \"./analysis-ipyrad/pedicularis_outfiles/pedicularis.loci\"\n", "\n", "## path to an unrooted tree inferred with tetrad\n", "newick = \"./analysis-tetrad/tutorial.tree\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### (optional): root the tree\n", "For abba-baba tests you will pretty much always want your tree to be rooted, since the test relies on an assumption about which alleles are ancestral. You can use our simple tree plotting library `toytree` to root your tree. This library uses [Toyplot](http://toyplot.rtfd.io) as its plotting backend, and [ete3](http://etetoolkit.org/) as its tree manipulation backend. \n", "\n", "Below I load in a newick string and root the tree on the two *P. przewalskii* samples using the `root()` function. You can either enter the names of the outgroup samples explicitly or enter a wildcard to select them. We show the rooted tree from a tetrad analysis below. The newick string of the rooted tree can be saved or accessed by the `.newick` attribute, like below. " ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
38362_rex39618_rex35236_rex35855_rex40578_rex30556_thamno33413_thamno41954_cyathophylloides41478_cyathophylloides30686_cyathophylla29154_superba33588_przewalskii32082_przewalskii
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "## load in the tree\n", "tre = toytree.tree(newick)\n", "\n", "## set the outgroup either as a list or using a wildcard selector\n", "tre.root(outgroup=[\"32082_przewalskii\", \"33588_przewalskii\"])\n", "tre.root(wildcard=\"prz\")\n", "\n", "## draw the tree\n", "tre.draw(width=400)\n", "\n", "## save the rooted newick string back to a variable and print\n", "newick = tre.newick" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Interpreting results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can see in the `results_table` below that the D-statistic range around 0.0-0.15 in these tests. These values are not too terribly informative, and so we instead generally focus on the Z-score representing how far the distribution of D-statistic values across bootstrap replicates deviates from its expected value of zero. The default number of bootstrap replicates to perform per test is 1000. Each replicate resamples nloci with replacement. \n", "\n", "In these tests ABBA and BABA occurred with pretty equal frequency. The values are calculated using SNP frequencies, which is why they are floats instead of integers, and this is also why we were able to combine multiple samples to represent a single tip in the tree (e.g., see the test we setup, above). " ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " dstat bootmean bootstd Z ABBA BABA nloci\n", "0 0.071 0.071 0.034 2.082 415.266 360.406 9133\n", "1 0.120 0.121 0.035 3.400 421.000 330.484 8611\n", "2 0.085 0.088 0.041 2.044 327.828 276.609 6849\n", "3 0.129 0.129 0.044 2.967 326.953 252.047 6505\n", "4 0.096 0.097 0.037 2.558 376.078 310.266 8413\n", "5 0.135 0.135 0.038 3.519 380.672 290.359 7939\n", "6 -0.092 -0.090 0.040 2.299 278.641 335.234 6863\n", "7 -0.109 -0.109 0.037 2.916 310.672 386.297 8439\n", "8 -0.085 -0.083 0.044 1.948 276.609 327.828 6849\n", "9 -0.096 -0.096 0.038 2.506 310.266 376.078 8413\n", "10 -0.129 -0.130 0.043 3.009 252.047 326.953 6505\n", "11 -0.135 -0.134 0.038 3.556 290.359 380.672 7939\n", "12 -0.023 -0.023 0.032 0.714 435.562 455.750 8208\n", "13 -0.013 -0.014 0.030 0.434 509.906 523.438 9513\n" ] } ], "source": [ "## show the results table\n", "print dd.results_table" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Running 5-taxon (partitioned) D-statistics\n", "To perform partitioned D-statistic tests is not any harder than running the standard four-taxon D-statistic tests. You simply enter your tests with 5 taxa in them now, listed as p1-p5. We have not developed a function to generate 5-taxon tests from a phylogeny, as this test is more appropriately applied to a smaller number of tests to further tease apart the meaning of significant 4-taxon results. See example above in the short tutorial. A simulation example will be added here soon..." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.13" } }, "nbformat": 4, "nbformat_minor": 1 }