{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "

ipyrad-analysis toolkit: treemix

\n", "\n", "The program [TreeMix](https://bitbucket.org/nygcresearch/treemix/wiki/Home) by [Pickrell & Pritchard (2012)](http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1002967) is used to infer population splits and admixture from allele frequency data. From the TreeMix documentation: \"In the underlying model, the modern-day populations in a species are related to a common ancestor via a graph of ancestral populations. We use the allele frequencies in the modern populations to infer the structure of this graph.\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Required software" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "# conda install treemix ipyrad ipcoal -c conda-forge -c bioconda" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "import ipyrad.analysis as ipa\n", "import toytree\n", "import toyplot\n", "import ipcoal" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ipyrad 0.9.54\n", "toytree 2.0.1\n", "TreeMix v. 1.12\r\n" ] } ], "source": [ "print('ipyrad', ipa.__version__)\n", "print('toytree', toytree.__version__)\n", "! treemix --version | grep 'TreeMix v. '" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Simulate example data" ] }, { "cell_type": "code", "execution_count": 76, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
r0r1r2r3r4r5r6
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# network model\n", "tree = toytree.rtree.unittree(7, treeheight=4e6, seed=123)\n", "tree.draw(ts='o', admixture_edges=(3, 2));" ] }, { "cell_type": "code", "execution_count": 78, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "wrote 1000 SNPs to /tmp/test-treemix.snps.hdf5\n" ] } ], "source": [ "# simulation model\n", "model = ipcoal.Model(tree, Ne=1e4, nsamples=4, admixture_edges=(3, 2, 0.5, 0.2))\n", "model.sim_snps(1000)\n", "model.write_snps_to_hdf5(name=\"test-treemix\", outdir=\"/tmp\", diploid=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Input data file" ] }, { "cell_type": "code", "execution_count": 110, "metadata": {}, "outputs": [], "source": [ "# the path to your HDF5 formatted snps file\n", "SNPS = \"/tmp/test-treemix.snps.hdf5\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Population assignments" ] }, { "cell_type": "code", "execution_count": 111, "metadata": {}, "outputs": [], "source": [ "IMAP = {\n", " \"r0\": [\"r0-0\", \"r0-1\"],\n", " \"r1\": [\"r1-0\", \"r1-1\"],\n", " \"r2\": [\"r2-0\", \"r2-1\"],\n", " \"r3\": [\"r3-0\", \"r3-1\"],\n", " \"r4\": [\"r4-0\", \"r4-1\"],\n", " \"r5\": [\"r5-0\", \"r5-1\"],\n", " \"r6\": [\"r6-0\", \"r6-1\"],\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load tool and filter missing data " ] }, { "cell_type": "code", "execution_count": 139, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Samples: 14\n", "Sites before filtering: 1000\n", "Filtered (indels): 0\n", "Filtered (bi-allel): 58\n", "Filtered (mincov): 0\n", "Filtered (minmap): 0\n", "Filtered (subsample invariant): 0\n", "Filtered (combined): 58\n", "Sites after filtering: 942\n", "Sites containing missing values: 0 (0.00%)\n", "Missing values in SNP matrix: 0 (0.00%)\n", "subsampled 942 unlinked SNPs\n" ] } ], "source": [ "tmx = ipa.treemix(SNPS, imap=IMAP, workdir=\"/tmp\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Set parameters" ] }, { "cell_type": "code", "execution_count": 140, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "bootstrap 0 \n", "climb 0 \n", "cormig 0 \n", "g (None, None) \n", "global_ 1 \n", "k 0 \n", "m 1 \n", "noss 0 \n", "root r4,r5,r6 \n", "se 0 \n", "seed 381302990 " ] }, "execution_count": 140, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tmx.params.root = \"r4,r5,r6\"\n", "tmx.params.m = 1\n", "tmx.params.global_ = 1\n", "tmx.params" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Run analysis" ] }, { "cell_type": "code", "execution_count": 141, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'/home/deren/miniconda3/envs/py36/bin/treemix -i /tmp/test.treemix.in.gz -o /tmp/test -m 1 -seed 381302990 -root r4,r5,r6 -global'" ] }, "execution_count": 141, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# the command that will be run\n", "tmx.command" ] }, { "cell_type": "code", "execution_count": 142, "metadata": {}, "outputs": [], "source": [ "# execute command\n", "tmx.run()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Parse results\n", "The result here is not accurate. Perhaps it would improve with more samples per lineage or more SNPs. " ] }, { "cell_type": "code", "execution_count": 143, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "admixture [(4, 0.123026, 7, 0.0657704, 0.039121)]\n", "cov [[ 0.193246 0.01681 0.0151671 -0.0440155 -0.0540499 -0.0633512\n", " -0.0638062 ]\n", " [ 0.01681 0.169992 0.0569728 -0.0510421 -0.0578917 -0.0671931\n", " -0.0676481 ]\n", " [ 0.0151671 0.0569728 0.178207 -0.051358 -0.0613924 -0.0685706\n", " -0.0690256 ]\n", " [-0.0440155 -0.0510421 -0.051358 0.156445 0.00681355 -0.00779569\n", " -0.00904682]\n", " [-0.0540499 -0.0578917 -0.0613924 0.00681355 0.118506 0.0246328\n", " 0.0233816 ]\n", " [-0.0633512 -0.0671931 -0.0685706 -0.00779569 0.0246328 0.114058\n", " 0.0682204 ]\n", " [-0.0638062 -0.0676481 -0.0690256 -0.00904682 0.0233816 0.0682204\n", " 0.117925 ]]\n", "llik 125.367 \n", "tree (r6:0.0853888,((r3:0.140711,(r2:0.0776395,(r1:0.0450637,r0:0.0504777):0.0657704):0.0430514):0.127177,(r5:0.111235,r4:0.123026):0.0474548):0.0853888);" ] }, "execution_count": 143, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tmx.results" ] }, { "cell_type": "code", "execution_count": 144, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
r0r1r2r3r4r5r60.00.20.4
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
0.1932460.0168100.015167-0.044015-0.054050-0.063351-0.063806r60.0168100.1699920.056973-0.051042-0.057892-0.067193-0.067648r50.0151670.0569730.178207-0.051358-0.061392-0.068571-0.069026r4-0.044015-0.051042-0.0513580.1564450.006814-0.007796-0.009047r3-0.054050-0.057892-0.0613920.0068140.1185060.0246330.023382r2-0.063351-0.067193-0.068571-0.0077960.0246330.1140580.068220r1-0.063806-0.067648-0.069026-0.0090470.0233820.0682200.117925r0r6r5r4r3r2r1r0
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "canvas1, axes1 = tmx.draw_tree();\n", "canvas2, axes2 = tmx.draw_cov();" ] }, { "cell_type": "code", "execution_count": 145, "metadata": {}, "outputs": [], "source": [ "# save your plots\n", "import toyplot.svg\n", "toyplot.svg.render(canvas1, \"/tmp/treemix-m1.svg\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Finding the best value for `m`\n", "\n", "As with structure plots there is no True best value, but you can use model selection methods to decide whether one is a statistically better fit to your data than another. Adding additional admixture edges will always improve the likelihood score, but with diminishing returns as you add additional edges that explain little variation in the data. You can look at the log likelihood score of each model fit by running a for-loop like below. You may want to run this within another for-loop that iterates over different subsampled SNPs. " ] }, { "cell_type": "code", "execution_count": 120, "metadata": {}, "outputs": [], "source": [ "tests = {}\n", "nadmix = [0, 1, 2, 3, 4, 5]\n", "\n", "# iterate over n admixture edges and store results in a dictionary\n", "for adm in nadmix:\n", " tmx.params.m = adm\n", " tmx.run()\n", " tests[adm] = tmx.results.llik" ] }, { "cell_type": "code", "execution_count": 122, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
012345n admixture edges124.5125.0125.5ln(likelihood)
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# plot the likelihood for different values of m\n", "toyplot.plot(\n", " nadmix,\n", " [tests[i] for i in nadmix],\n", " width=350, \n", " height=275,\n", " stroke_width=3,\n", " xlabel=\"n admixture edges\",\n", " ylabel=\"ln(likelihood)\",\n", ");" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.7" } }, "nbformat": 4, "nbformat_minor": 2 }