{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# ipyrad-analysis toolkit: treemix\n", "\n", "[View as notebook](https://nbviewer.jupyter.org/github/dereneaton/ipyrad/blob/master/newdocs/API-analysis/cookbook-treemix.ipynb)\n", "\n", "The program [TreeMix](https://bitbucket.org/nygcresearch/treemix/wiki/Home) by [Pickrell & Pritchard (2012)](http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1002967) is used to infer population splits and admixture from allele frequency data. From the TreeMix documentation: \"In the underlying model, the modern-day populations in a species are related to a common ancestor via a graph of ancestral populations. We use the allele frequencies in the modern populations to infer the structure of this graph.\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Required software\n", "A minor detail of the `treemix` conda installation is that it installs a bunch of junk alongside it, including R and openblas, into your conda environment. It's a ton of bloat that in my experience has a high chance of causing incompatibilities with other packages in your conda installation eventually. For this reason I recommend installing it into a separate conda environment. \n", "\n", "The installation instructions below can be used to setup a new environment called treemix that you can use for this analyis. This will avoid the installation from conflicting with any of your other software. When you are done you can switch back to your default environment." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# conda init\n", "# conda create -n treemix\n", "# conda activate treemix\n", "# conda install treemix -c bioconda -c conda-forge\n", "# conda install ipyrad -c bioconda -c conda-forge\n", "# conda install jupyter -c conda-forge\n", "# conda install toytree -c eaton-lab -c conda-forge\n", "# jupyter-notebook" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import ipyrad.analysis as ipa\n", "import toytree\n", "import toyplot" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ipyrad 0.9.15\n", "toytree 0.2.3\n", "TreeMix v. 1.13\r\n" ] } ], "source": [ "print('ipyrad', ipa.__version__)\n", "print('toytree', toytree.__version__)\n", "! treemix --version | grep 'TreeMix v. '" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Required input data files\n", "Your input data should be a `.snps.hdf` database file produced by ipyrad. If you do not have this you can generate it from any VCF file following the [vcf2hdf5 tool tutorial](https://ipyrad.readthedocs.io/en/latest/API-analysis/cookbook-vcf2hdf5.html). The database file contains the genotype calls information as well as linkage information that is used for subsampling unlinked SNPs and bootstrap resampling." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# the path to your HDF5 formatted snps file\n", "data = \"/home/deren/Downloads/ref_pop2.snps.hdf5\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Short Tutorial:\n", "\n", "If you entered population information during data assembly then you may have already produced a `.treemix.gz` output file that can be used as input to the treemix command line program. Alternatively, you can run treemix using the ipyrad tool here which offers some additional flexibility for filtering SNP data, and for running treemix programatically over many parameter settings. \n", "\n", "The key features offered by `ipa.treemix` include: \n", "\n", "1. Filter unlinked SNPs (1 per locus) many times for replicate analyses.\n", "2. Filter by sample or populations coverage.\n", "3. Plotting functions. \n", "4. Easy to write for-loops " ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# group individuals into populations\n", "imap = {\n", " \"virg\": [\"TXWV2\", \"LALC2\", \"SCCU3\", \"FLSF33\", \"FLBA140\"],\n", " \"mini\": [\"FLSF47\", \"FLMO62\", \"FLSA185\", \"FLCK216\"],\n", " \"gemi\": [\"FLCK18\", \"FLSF54\", \"FLWO6\", \"FLAB109\"],\n", " \"bran\": [\"BJSL25\", \"BJSB3\", \"BJVL19\"],\n", " \"fusi\": [\"MXED8\", \"MXGT4\", \"TXGR3\", \"TXMD3\"],\n", " \"sagr\": [\"CUVN10\", \"CUCA4\", \"CUSV6\", \"CUMM5\"],\n", " \"oleo\": [\"CRL0030\", \"HNDA09\", \"BZBB1\", \"MXSA3017\"],\n", "}\n", "\n", "# minimum % of samples that must be present in each SNP from each group\n", "minmap = {i: 0.5 for i in imap}" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Samples: 28\n", "Sites before filtering: 349914\n", "Filtered (indels): 0\n", "Filtered (bi-allel): 13077\n", "Filtered (mincov): 0\n", "Filtered (minmap): 109383\n", "Filtered (combined): 117718\n", "Sites after filtering: 232196\n", "Sites containing missing values: 221834 (95.54%)\n", "Missing values in SNP matrix: 822320 (12.65%)\n", "subsampled 29923 unlinked SNPs\n" ] } ], "source": [ "# init a treemix analysis object with some param arguments\n", "tmx = ipa.treemix(\n", " data=data, \n", " imap=imap,\n", " minmap=minmap, \n", " seed=123456,\n", " root=\"bran,fusi\",\n", " m=2,\n", ")" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "treemix -i /home/deren/Documents/ipyrad/newdocs/API-analysis/analysis-treemix/test.treemix.in.gz -o /home/deren/Documents/ipyrad/newdocs/API-analysis/analysis-treemix/test -m 2 -seed 123456 -root bran,fusi\n" ] } ], "source": [ "# print the command string that will be called and run it\n", "print(tmx.command)\n", "tmx.run()" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
minigemivirgoleosagrfusibran0.000.050.10
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# draw the resulting tree\n", "tmx.draw_tree();" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
0.0392740.008558-0.010861-0.007422-0.009953-0.010672-0.008925bran0.0085580.017783-0.006984-0.005362-0.003162-0.005733-0.005100fusi-0.010861-0.0069840.0175430.009980-0.003749-0.002483-0.003446sagr-0.007422-0.0053620.0099800.030868-0.009673-0.009907-0.008485oleo-0.009953-0.003162-0.003749-0.0096730.0182450.0049720.003321virg-0.010672-0.005733-0.002483-0.0099070.0049720.0167360.007087gemi-0.008925-0.005100-0.003446-0.0084850.0033210.0070870.015547minibranfusisagroleovirggemimini
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# draw the covariance matrix\n", "tmx.draw_cov();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1. Finding the best value for `m`\n", "\n", "As with structure plots there is no True best value, but you can use model selection methods to decide whether one is a statistically better fit to your data than another. Adding additional admixture edges will always improve the likelihood score, but with diminishing returns as you add additional edges that explain little variation in the data. You can look at the log likelihood score of each model fit by running a for-loop like below. You may want to run this within another for-loop that iterates over different subsampled SNPs. " ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Samples: 28\n", "Sites before filtering: 349914\n", "Filtered (indels): 0\n", "Filtered (bi-allel): 13077\n", "Filtered (mincov): 0\n", "Filtered (minmap): 109383\n", "Filtered (combined): 117718\n", "Sites after filtering: 232196\n", "Sites containing missing values: 221834 (95.54%)\n", "Missing values in SNP matrix: 822320 (12.65%)\n", "subsampled 29923 unlinked SNPs\n" ] } ], "source": [ "# init a treemix analysis object with some param arguments\n", "tmx = ipa.treemix(\n", " data=data, \n", " imap=imap,\n", " minmap=minmap, \n", " seed=1234,\n", " root=\"bran,fusi\",\n", ")" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "tests = {}\n", "nadmix = [0, 1, 2, 3, 4, 5]\n", "\n", "# iterate over n admixture edges and store results in a dictionary\n", "for adm in nadmix:\n", " tmx.params.m = adm\n", " tmx.run()\n", " tests[adm] = tmx.results.llik" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
012345n admixture edges-1000100200ln(likelihood)
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# plot the likelihood for different values of m\n", "toyplot.plot(\n", " nadmix,\n", " [tests[i] for i in nadmix],\n", " width=350, \n", " height=275,\n", " stroke_width=3,\n", " xlabel=\"n admixture edges\",\n", " ylabel=\"ln(likelihood)\",\n", ");" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2. Iterate over different subsamples of SNPs\n", "\n", "The treemix tool randomly subsamples 1 SNP per locus to reduce the effect of linkage on the results. However, depending on the size of your data set, and the strength of the signal, subsampling may yield slightly different results in different iterations. You can check over different subsampled iterations by re-initing the treemix tool with a different (or no) random seed. Below I plot the results of 9 iterations for m=2. I also use the `global_=True` option here which performs a more thorough (but slower) search." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/html": [ "
gemiminivirgsagroleofusibran0.000.050.11virggemiminioleosagrbranfusi0.000.040.09gemiminivirgoleosagrfusibran0.000.060.13minigemivirgsagroleofusibran0.000.060.11minigemivirgoleosagrbranfusi0.000.050.10virggemiminioleosagrbranfusi0.000.050.10minigemivirgsagroleofusibran0.000.050.10minigemivirgsagroleofusibran0.000.060.12gemiminivirgsagroleofusibran0.000.050.09
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# a gridded canvas to plot trees on \n", "canvas = toyplot.Canvas(width=600, height=700)\n", "\n", "# iterate over multiple set of SNPs\n", "for i in range(9):\n", " \n", " # init a treemix analysis object with a random (no) seed\n", " tmx = ipa.treemix(\n", " data=data, \n", " imap=imap,\n", " minmap=minmap,\n", " root=\"bran,fusi\",\n", " global_=True,\n", " m=2,\n", " quiet=True\n", " )\n", " \n", " # run model fit\n", " tmx.run()\n", " \n", " # select a plot grid axis and add tree to axes\n", " axes = canvas.cartesian(grid=(3, 3, i))\n", " tmx.draw_tree(axes)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
gemiminivirgoleosagrfusibran0.000.050.09minigemivirgoleosagrfusibran0.000.050.11gemiminivirgoleosagrbranfusi0.000.050.10minigemivirgsagroleobranfusi0.000.040.08gemiminivirgoleosagrbranfusi0.000.040.08minigemivirgoleosagrfusibran0.000.060.11gemiminivirgsagroleofusibran0.000.050.10gemivirgminisagroleobranfusi0.000.050.10gemiminivirgoleosagrfusibran0.000.060.13
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# a gridded canvas to plot trees on \n", "canvas = toyplot.Canvas(width=600, height=700)\n", "\n", "# iterate over multiple set of SNPs\n", "for i in range(9):\n", " \n", " # init a treemix analysis object with a random (no) seed\n", " tmx = ipa.treemix(\n", " data=data, \n", " imap=imap,\n", " minmap=minmap,\n", " root=\"bran,fusi\",\n", " global_=True,\n", " m=3,\n", " quiet=True\n", " )\n", " \n", " # run model fit\n", " tmx.run()\n", " \n", " # create a grid axis and add tree to axes\n", " axes = canvas.cartesian(grid=(3, 3, i))\n", " tmx.draw_tree(axes)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3. Save the plot to pdf" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "import toyplot.pdf\n", "toyplot.pdf.render(canvas, \"treemix-m3.pdf\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.7" } }, "nbformat": 4, "nbformat_minor": 2 }