{ "cells": [ { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "# ipyrad-analysis toolkit: abba-baba\n", "\n", "The `baba` tool can be used to measure abba-baba statistics across many different hypotheses on a tree, to easily group individuals into populations for measuring abba-baba using allele frequencies, and to summarize or plot the results of many analyses. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load packages" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# conda install ipyrad -c conda-forge -c bioconda\n", "# conda install ipcoal -c conda-forge" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import ipyrad.analysis as ipa\n", "import ipcoal\n", "import toytree\n", "import toyplot" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.9.57\n", "0.1.4\n", "0.19.0\n", "2.0.3\n" ] } ], "source": [ "print(ipa.__version__)\n", "print(ipcoal.__version__)\n", "print(toyplot.__version__)\n", "print(toytree.__version__)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Simulate linked SNP data from this tree/network\n", "Here we simulate data on a tree/network that includes an introgressive edge. To approximate RAD-seq data we will also make ~50% of the data missing to approximate dropout from mutation disruption or insufficient sequencing coverage. " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
0123456789101112131415161718r0r1r2r3r4r5r6r7r8r90500000010000000
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# generate a balance tree\n", "tree = toytree.rtree.baltree(ntips=10, treeheight=10e6)\n", "\n", "# draw the tree w/ an admixture edge\n", "tree.draw(ts='p', admixture_edges=(2, 3));" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "wrote 64626 SNPs to /tmp/test-baba-miss50.snps.hdf5\n" ] } ], "source": [ "# create a simulation model for this tree/network: (src, dest, time-prop., admix-prop.)\n", "model = ipcoal.Model(tree=tree, nsamples=2, Ne=4e5, admixture_edges=(2, 3, 0.5, 0.15))\n", "\n", "# simulate N loci\n", "model.sim_loci(nloci=3000, nsites=50)\n", "\n", "# drop 50% as missing\n", "model.apply_missing_mask(0.5)\n", "\n", "# write result to a database file\n", "model.write_snps_to_hdf5(name=\"test-baba-miss50\", outdir=\"/tmp\", diploid=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Input data for BABA tests\n", "We need a SNPs database and we need to know the sample names that in that database. As we will show, one easy way to generate tests for a dataset is to use a tree that has the sample labels on the tips (like we created above.) This is optional though, you can also simply write out the tests by hand. " ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# init a baba tool from your SNPs database\n", "tool = ipa.baba2(\"/tmp/test-baba-miss50.snps.hdf5\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The format of test hypotheses\n", "The simplest way to describe a test is using a dictionary with keys labeled as \"p1\", \"p2\", \"p3\", \"p4\", and the values listing the tip names that will be used to represent a population/species. When multiple samples are listed the pooled allele frequency will be used. " ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "IMAP1 = {\n", " \"p4\": [\"r7\", \"r6\", \"r5\"],\n", " \"p3\": [\"r3\"],\n", " \"p2\": [\"r2\"],\n", " \"p1\": [\"r0\", \"r1\"],\n", "}\n", "\n", "IMAP2 = {\n", " \"p4\": [\"r7\", \"r6\", \"r5\"],\n", " \"p3\": [\"r3\"],\n", " \"p2\": [\"r1\"],\n", " \"p1\": [\"r0\"],\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### A simple tests\n", "Here we will run the first tests (IMAP1) which includes a scenario where we expect to see evidence of admixture since our simulation included gene flow from P3 to P2 backward in time. It is good to start with simple tests before proceeding to run many tests in parallel. The output from a single test is verbose about how SNP filtering is applied, and how many total loci, SNPs, and discordant SNPs are present. \n", "\n", "Running an ABBA-BABA test requires that there is data for at least one sample in each population. But when you have multiple samples per population you can also apply further filtering using a minmap argument. For example, a minmap with values set to 0.75 would require that data is present for 75% of samples in a population for the SNP to be included in analyses. The verbose output below will print how many SNPs are filtered by the minmap and other filters." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Samples: 7\n", "Sites before filtering: 64626\n", "Filtered (indels): 0\n", "Filtered (bi-allel): 8096\n", "Filtered (mincov): 3565\n", "Filtered (minmap): 29365\n", "Filtered (subsample invariant): 15099\n", "Filtered (combined): 41455\n", "Sites after filtering: 23171\n", "Sites containing missing values: 16291 (70.31%)\n", "Missing values in SNP matrix: 24400 (15.04%)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DbootstdZABBABABAnSNPsnloci
00.5330.03216.776551.667167.833231711566
\n", "
" ], "text/plain": [ " D bootstd Z ABBA BABA nSNPs nloci\n", "0 0.533 0.032 16.776 551.667 167.833 23171 1566" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# run a single test \n", "tool.run_test(IMAP1)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Samples: 7\n", "Sites before filtering: 64626\n", "Filtered (indels): 0\n", "Filtered (bi-allel): 8096\n", "Filtered (mincov): 3565\n", "Filtered (minmap): 34239\n", "Filtered (subsample invariant): 15099\n", "Filtered (combined): 44594\n", "Sites after filtering: 20032\n", "Sites containing missing values: 13152 (65.65%)\n", "Missing values in SNP matrix: 16866 (12.03%)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DbootstdZABBABABAnSNPsnloci
00.540.03714.706457.667136.833200321334
\n", "
" ], "text/plain": [ " D bootstd Z ABBA BABA nSNPs nloci\n", "0 0.54 0.037 14.706 457.667 136.833 20032 1334" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# run the same test with different filtering applied\n", "tool.run_test(imap=IMAP1, minmap={i: 0.5 for i in IMAP1})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Running multiple tests (simple)\n", "You may wish to run many tests and collect the results in a table. This can be automated and run in parallel using the `.run()` argument. The tests are submitted as a list of dictionaries. " ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[####################] 100% 0:00:12 | abba-baba tests \n" ] } ], "source": [ "# enter multiple tests (imaps) and multiple minmaps to run in parallel\n", "tool.run(imaps=[IMAP1, IMAP2])" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DbootstdZABBABABAnSNPsnloci
00.5330.03415.715551.667167.833231711566
10.1730.0662.640103.58373.042176491280
\n", "
" ], "text/plain": [ " D bootstd Z ABBA BABA nSNPs nloci\n", "0 0.533 0.034 15.715 551.667 167.833 23171 1566\n", "1 0.173 0.066 2.640 103.583 73.042 17649 1280" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# show the results\n", "tool.results_table" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Running multiple tests (automated)\n", "Using a tree you can conveniently describe a large set of hypotheses to test that are all compatible with the phylogenetic topology. The list of tests can be further added to by hand if you want to also include tests that are incompatible with the phylogeny. Multiple sample names can be selected by entering the idx node label from the tree drawing (see below)." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
0123456789101112131415161718r0r1r2r3r4r5r6r7r8r90500000010000000
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "tree.draw(ts='p', admixture_edges=(2, 3));" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The function `.generate_tests_from_tree()` takes the tree as an argument in addition to a `constraint_dict` and a `constraint_exact` argument. This tool will then generate a list of all tests that are compatible with the tree given the constraints put on it. For example, (None, None, None, 13) would only put the constraint that the \"p4\" taxon must descend from node 13. The `constraint_exact=1` on node 13 means that tests must include *all* descendants of node 13. By contrast, \"p3\" is descendant of node 10 but has `constraint_exact=0` meaning it can be node 0, 1, 2, 3, 8, 9 or 10. Still the \"p3\" options will be constrained by the topology such that the sister lineage must include at least two tips. This leads to 10 possible tests below, shown by their idx node labels in the order p1-4." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "12 tests generated from tree\n" ] } ], "source": [ "# generate a list with tests from several constraints\n", "tests1 = tool.generate_tests_from_tree(\n", " tree=tree, \n", " constraint_dict=(None, None, 13, 17), \n", " constraint_exact=(0, 0, 0, 1),\n", ")" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "6 tests generated from tree\n" ] } ], "source": [ "# generate a list with tests from several constraints\n", "tests2 = tool.generate_tests_from_tree(\n", " tree=tree, \n", " constraint_dict=(None, None, 12, 17), \n", " constraint_exact=(0, 0, 0, 1),\n", ")" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "tests = tests1 + tests2" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "# # add more tests manually (tip: get tips desc from node idx using .get_tip_labels())\n", "# tests.append({\n", "# 'p1': tree.get_tip_labels(0),\n", "# 'p2': ['r1', 'r3'], \n", "# 'p3': tree.get_tip_labels(4), \n", "# 'p4': tree.get_tip_labels(13), \n", "# })" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[####################] 100% 0:01:08 | abba-baba tests \n" ] } ], "source": [ "# pass a list with tests to .run() and see results in .results_table\n", "tool.run(tests)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Interpreting results\n", "The results can be viewed as a pandas DataFrame in (`.results_table`) and saved to a CSV file. The most relevant statistic in the Z score, which represents the number of standard deviations (measured by bootstrap resampling) that D deviates from zero. Values greater than 3 are generally considered significant, but you can lookup Z for given alpha significance level, and you should consider that there are many tests as well when deciding on significance. \n", "\n", "For ease of viewing (maybe) the taxa that are involved in each test are stored in a separate data frame (`.taxon_table`). You can concatenate these two dataframes together before saving if you wish, as you could with any pandas dataframes. " ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DbootstdZABBABABAnSNPsnloci
00.5780.03118.628512.008136.767277931586
10.5010.03514.347435.863145.029288301611
20.4770.03314.663522.358184.921361631970
30.4520.03811.867407.950153.871279181568
40.1490.0901.65280.93359.896214821294
50.0740.0701.05593.64280.696283011619
60.3000.0446.829246.942133.100211861244
70.6270.03916.228434.29299.600219841291
80.5480.04512.248405.867118.658215831270
90.3460.0418.384328.650159.625276441571
100.0670.0870.77976.95067.237213721284
110.3580.0497.266273.450129.358220821288
12-0.3140.0516.127143.167274.025212241260
130.0210.0560.385131.800126.258217351270
14-0.0070.0510.131161.925164.092275601566
150.1860.0792.35295.08365.317205481281
16-0.1370.0462.982168.461222.025303491639
17-0.0050.0570.096139.200140.725214331255
\n", "
" ], "text/plain": [ " D bootstd Z ABBA BABA nSNPs nloci\n", "0 0.578 0.031 18.628 512.008 136.767 27793 1586\n", "1 0.501 0.035 14.347 435.863 145.029 28830 1611\n", "2 0.477 0.033 14.663 522.358 184.921 36163 1970\n", "3 0.452 0.038 11.867 407.950 153.871 27918 1568\n", "4 0.149 0.090 1.652 80.933 59.896 21482 1294\n", "5 0.074 0.070 1.055 93.642 80.696 28301 1619\n", "6 0.300 0.044 6.829 246.942 133.100 21186 1244\n", "7 0.627 0.039 16.228 434.292 99.600 21984 1291\n", "8 0.548 0.045 12.248 405.867 118.658 21583 1270\n", "9 0.346 0.041 8.384 328.650 159.625 27644 1571\n", "10 0.067 0.087 0.779 76.950 67.237 21372 1284\n", "11 0.358 0.049 7.266 273.450 129.358 22082 1288\n", "12 -0.314 0.051 6.127 143.167 274.025 21224 1260\n", "13 0.021 0.056 0.385 131.800 126.258 21735 1270\n", "14 -0.007 0.051 0.131 161.925 164.092 27560 1566\n", "15 0.186 0.079 2.352 95.083 65.317 20548 1281\n", "16 -0.137 0.046 2.982 168.461 222.025 30349 1639\n", "17 -0.005 0.057 0.096 139.200 140.725 21433 1255" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# the results of tests run with .run are summarized in .results_table\n", "tool.results_table" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
p1p2p3p4
0r0,r1r2r3r5,r6,r7,r8,r9
1r0r2r3,r4r5,r6,r7,r8,r9
2r0,r1r2r3,r4r5,r6,r7,r8,r9
3r1r2r3,r4r5,r6,r7,r8,r9
4r0r1r3r5,r6,r7,r8,r9
5r0r1r3,r4r5,r6,r7,r8,r9
6r1r2r4r5,r6,r7,r8,r9
7r0r2r3r5,r6,r7,r8,r9
8r1r2r3r5,r6,r7,r8,r9
9r0,r1r2r4r5,r6,r7,r8,r9
10r0r1r4r5,r6,r7,r8,r9
11r0r2r4r5,r6,r7,r8,r9
12r3r4r2r5,r6,r7,r8,r9
13r3r4r0r5,r6,r7,r8,r9
14r3r4r0,r1r5,r6,r7,r8,r9
15r0r1r2r5,r6,r7,r8,r9
16r3r4r0,r1,r2r5,r6,r7,r8,r9
17r3r4r1r5,r6,r7,r8,r9
\n", "
" ], "text/plain": [ " p1 p2 p3 p4\n", "0 r0,r1 r2 r3 r5,r6,r7,r8,r9\n", "1 r0 r2 r3,r4 r5,r6,r7,r8,r9\n", "2 r0,r1 r2 r3,r4 r5,r6,r7,r8,r9\n", "3 r1 r2 r3,r4 r5,r6,r7,r8,r9\n", "4 r0 r1 r3 r5,r6,r7,r8,r9\n", "5 r0 r1 r3,r4 r5,r6,r7,r8,r9\n", "6 r1 r2 r4 r5,r6,r7,r8,r9\n", "7 r0 r2 r3 r5,r6,r7,r8,r9\n", "8 r1 r2 r3 r5,r6,r7,r8,r9\n", "9 r0,r1 r2 r4 r5,r6,r7,r8,r9\n", "10 r0 r1 r4 r5,r6,r7,r8,r9\n", "11 r0 r2 r4 r5,r6,r7,r8,r9\n", "12 r3 r4 r2 r5,r6,r7,r8,r9\n", "13 r3 r4 r0 r5,r6,r7,r8,r9\n", "14 r3 r4 r0,r1 r5,r6,r7,r8,r9\n", "15 r0 r1 r2 r5,r6,r7,r8,r9\n", "16 r3 r4 r0,r1,r2 r5,r6,r7,r8,r9\n", "17 r3 r4 r1 r5,r6,r7,r8,r9" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# the sample names in tests that were run are also summarized in .taxon_table\n", "tool.taxon_table" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Plotting results\n", "The `.draw()` function takes a tree as input and will organize the results stored to the `baba` tool object into a more easily interpretable diagram. Here the p4, p3, p2, and p1 samples in each tests are indicated by different colors. The histogram on the left shows the distribution of bootstrap D-statistic values in each test and is colored to show the dominant pattern ABBA (orange) or BABA (green) if the test is significant at p=0.01 (Z > 2.5). The data used in this plot is from the `.results_table` and `.taxon_table` shown above." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Z-score: -0.38Z-score: -0.1Z-score: -2.35Z-score: -6.13Z-score: -1.65Z-score: -16.23Z-score: -12.25Z-score: -18.63Z-score: -0.78Z-score: -7.27Z-score: -6.83Z-score: -8.38Z-score: -0.13Z-score: -2.98Z-score: -1.05Z-score: -14.35Z-score: -11.87Z-score: -14.66-20-100Z-scoreD-statistic: 0.02D-statistic: -0.01D-statistic: 0.19D-statistic: -0.31D-statistic: 0.15D-statistic: 0.63D-statistic: 0.55D-statistic: 0.58D-statistic: 0.07D-statistic: 0.36D-statistic: 0.3D-statistic: 0.35D-statistic: -0.01D-statistic: -0.14D-statistic: 0.07D-statistic: 0.5D-statistic: 0.45D-statistic: 0.48-0.50.00.8D-statistic17161514131211109876543210r0r1r2r3r4r5r6r7r8r9r3r4r0r5\n", "r6\n", "r7\n", "r8\n", "r9r3r4r1r5\n", "r6\n", "r7\n", "r8\n", "r9r0r1r2r5\n", "r6\n", "r7\n", "r8\n", "r9r3r4r2r5\n", "r6\n", "r7\n", "r8\n", "r9r0r1r3r5\n", "r6\n", "r7\n", "r8\n", "r9r0r2r3r5\n", "r6\n", "r7\n", "r8\n", "r9r1r2r3r5\n", "r6\n", "r7\n", "r8\n", "r9r0\n", "r1r2r3r5\n", "r6\n", "r7\n", "r8\n", "r9r0r1r4r5\n", "r6\n", "r7\n", "r8\n", "r9r0r2r4r5\n", "r6\n", "r7\n", "r8\n", "r9r1r2r4r5\n", "r6\n", "r7\n", "r8\n", "r9r0\n", "r1r2r4r5\n", "r6\n", "r7\n", "r8\n", "r9r3r4r0\n", "r1r5\n", "r6\n", "r7\n", "r8\n", "r9r3r4r0\n", "r1\n", "r2r5\n", "r6\n", "r7\n", "r8\n", "r9r0r1r3\n", "r4r5\n", "r6\n", "r7\n", "r8\n", "r9r0r2r3\n", "r4r5\n", "r6\n", "r7\n", "r8\n", "r9r1r2r3\n", "r4r5\n", "r6\n", "r7\n", "r8\n", "r9r0\n", "r1r2r3\n", "r4r5\n", "r6\n", "r7\n", "r8\n", "r9
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# generate a summary drawing of test results\n", "canvas = tool.draw(tree=tree, sort=True, width=500, height=500)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "# save the drawing to a PDF\n", "import toyplot.pdf\n", "toyplot.pdf.render(canvas, \"/tmp/BABA-plot.pdf\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Non-independence of D-statistics\n", "While ABBA-BABA tests are a powerful tool for testing hypotheses about admixture, they are seriously prone to false positives when interpreting results. This is because each 4-taxon test is non-independent of other tests that involve related taxa ([Eaton et al. 2015](https://onlinelibrary.wiley.com/doi/10.1111/evo.12758)). \n", "\n", "This is clear in the example above, where the true introgressive pattern was \"r3\" into \"r2\" (forward in time). The results show admixture between these samples (e.g., tests 3, 5, 6, 7, 13 above), however, they also pick up a signal of introgression between \"r4\" and \"r2\" (e.g., tests 9,10, 11). This is because 'r3' shares ancestry with 'r4', and so it passed alleles into 'r2' that it shares with 'r4'. Ideally we would want to be able to distinguish whether 'r4' and 'r3' both introgressed into 'r2' independently, or whether it is just a signal of their shared ancestry. This is the goal of \"partitioned D-statistics\" (Eaton et al. 2013, 2015). " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Running 5-taxon (partitioned) D-statistics\n", "To run a 5-taxon partitioned D-statistic test you must define tests using a dictionary that includes two clades sampled as potential introgressive donors in the P3 clade. These are labeled `p3_1` and `p3_2`. The partitioned test then returns three D-statistics: D12 represents introgression of alleles shared by the two P3 sublineages, D1 is alleles that uniquely introgressed from P3_1 and D2 is alleles that uniquely introgressed from P3_2. The D12 statistic is particularly powerful since it indicates the direction of introgression. **If a test does not have a significant D12 then introgression from that P3 lineage is not supported**. You should try testing in the opposite direction, as in the example below. \n" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "tool = ipa.baba2(\"/tmp/test-baba-miss50.snps.hdf5\")" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "# testing r3 vs r4 as an introgressive donor into r2 or r0,r1\n", "IMAP5 = {\n", " \"p1\": [\"r0\", \"r1\"],\n", " \"p2\": [\"r2\"],\n", " \"p3_1\": [\"r3\"],\n", " \"p3_2\": [\"r4\"],\n", " \"p4\": ['r5', 'r6', 'r7'],\n", "}\n", "\n", "# the reverse test of 'r2' versus 'r0,r1' as introgressive donors into r3 or r4\n", "IMAP5_REV = {\n", " \"p1\": [\"r3\"],\n", " \"p2\": [\"r4\"],\n", " \"p3_1\": [\"r0\", \"r1\"],\n", " \"p3_2\": [\"r2\"],\n", " \"p4\": ['r5', 'r6', 'r7'],\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Partitioned D-statistic results\n", "The first test (IMAP5) has a significant Z12 with many more ABBBA patterns than BABBA, supporting P3 as an introgressive donor into P2. Of the two P3 lineages, we see that there is significant support for introgression of P3_1 into P2 (Z1) but not of P3_2 into P2 (Z2). Thus we know that introgression occurred from P3_1 and not P3_2 into P2. " ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
D12D1D2boot12stdboot1stdboot2stdZ12Z1Z2ABBBABABBAABBAABABAAABABABAABAnSNPsnloci
00.4130.6210.0810.0480.0550.0848.6311.1980.964212.088.167179.41742.059.58350.667190671173
\n", "
" ], "text/plain": [ " D12 D1 D2 boot12std boot1std boot2std Z12 Z1 Z2 \\\n", "0 0.413 0.621 0.081 0.048 0.055 0.084 8.63 11.198 0.964 \n", "\n", " ABBBA BABBA ABBAA BABAA ABABA BAABA nSNPs nloci \n", "0 212.0 88.167 179.417 42.0 59.583 50.667 19067 1173 " ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# significant Z12 and Z1 \n", "tool.run_partitioned_test(IMAP5, quiet=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The ability for partitioned D-statistics to show the direction of introgression is clearly demonstrated in the reverse test. We know that in our simulation introgression occured from `r3` into `r2` (forward in time), but the test below (IMAP5_REV) is asking about the reverse pattern. As we saw above, the 4-taxon tests were not able to clearly distinguish the direction of introgression. However, here we can see that there *not a significant Z12 statistic*, meaning that this direction of introgression is not supported. The significant Z2 statistic shows that the P3_2 taxon shares many alleles with P2 (there is admixture) but the lack of significant Z12 means it is not supported as an introgressive donor." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
D12D1D2boot12stdboot1stdboot2stdZ12Z1Z2ABBBABABBAABBAABABAAABABABAABAnSNPsnloci
00.0210.094-0.5010.0640.0930.0560.3221.0089.033104.5100.2550.66742.059.583179.417190671173
\n", "
" ], "text/plain": [ " D12 D1 D2 boot12std boot1std boot2std Z12 Z1 Z2 \\\n", "0 0.021 0.094 -0.501 0.064 0.093 0.056 0.322 1.008 9.033 \n", "\n", " ABBBA BABBA ABBAA BABAA ABABA BAABA nSNPs nloci \n", "0 104.5 100.25 50.667 42.0 59.583 179.417 19067 1173 " ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# non-significant Z12 means that the hypothesis is not supported\n", "tool.run_partitioned_test(IMAP5_REV, quiet=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Partitioned D-tests versus Dfoil...\n", "You may have seen a paper in 2015 criticizing the partitioned D-statistics and presenting a modification to this method named D-foil. I do not agree with the criticism in this paper of the partitioned tests. The paper suffered major problems with its simulation framework (e.g., ms simulations with gene flow in the wrong direction, migration rates >50%, etc.) that made the critique of partitioned D-tests unrealistic and artificial, in my opinion." ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.8" } }, "nbformat": 4, "nbformat_minor": 1 }