{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# ipyrad-analysis toolkit: tetrad\n", "\n", "`tetrad` is a species tree inference tool based on the SVDQuartets algorithm of Chifman and Kubatko. It uses the theory of phylogenetic invariants to resolve quartet trees from SNPs for all sets of quartets in a larger tree, and then joins the quartets together into a supertree using QMC. Here I demonstrate how to call tetrad from the ipyrad-analysis tools, which is convenient for keeping all of your analyses in jupyter notebooks. Alternatively, you could also call tetrad as a command-line tool (see the [tetrad docs](https://github.com/eaton-lab/tetrad)). \n", "\n", "The following features of `tetrad` are highlighted:\n", "\n", "**Parallelization**: tetrad can be massively parallelized on an HPC cluster. It approximately scales linearly with the number of available cores. Using the flexible ipyparallel backend, you can even parallelize over multiple nodes using MPI ([more details coming soon]). \n", "\n", "**Bootstrap sampling**: In contrast to SVDquartets, tetrad is designed particularly to work well with RAD-seq data, though it can apply to any SNP data set. Bootstrap replicates resample the number of loci with replacement to the same size as the original data set. \n", "\n", "**SNP subsampling**: The underlying model assumes that the examined SNPs are unlinked, and tetrad uses a very efficient method to maximize the number of unlinked SNPs used in the analysis. A very crude way to achieve unlinked SNPs would be to sample one SNP per locus before beginning the analysis. However, a site that is variable in the context of all your samples is not necessarily informative for any given quartet of samples. Instead, tetrad subsamples a single SNP from every locus separately for every quartet set in the analysis, and repeats this independently in every bootstrap replicate. This way the maximum amount of unlinked SNP information is used in every quartet inference. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Required software" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# conda install ipyrad -c bioconda\n", "# conda install tetrad -c eaton-lab -c conda-forge" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import ipyrad.analysis as ipa\n", "import toytree" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Input data\n", "\n", "The input data file should be the `.snps.hdf5` file produced by ipyrad (v.0.9.13 or newer). If you did not assemble your data in ipyrad then you can convert your SNPs data to this format from any VCF using the [vcf_to_hdf5 tool](https://ipyrad.readthedocs.io/en/latest/API-analysis/cookbook-vcf2hdf5.html). " ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# the path to your sequence data in HDF5 format\n", "data = \"/home/deren/Documents/virentes-reference/analysis-ipyrad/ref_min4_outfiles/ref_min4.snps.hdf5\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Initialize the analysis object\n", "Here you can enter parameters for the analysis. By default only a subsample of the total quartets will be inferred. To instead infer all quartets simply enter a very large number for the `nquartets` parameter and it will use the maximum. When you initialize the object it will print the size of your dataset, the number of loci, and the number of quartets. The number of loci is of interest because this is the maximum number of SNPs that will be used in an analysis. " ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "loading snps array [37 taxa x 1182005 snps]\n", "max unlinked SNPs per quartet [nloci]: 88938\n", "quartet sampler [full]: 66045 / 66045\n" ] } ], "source": [ "# init analysis object with input data and (optional) parameter options\n", "tet = ipa.tetrad(\n", " name=\"virentes-min4\",\n", " data=data,\n", " nquartets=1e6,\n", " nboots=16,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Call run" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Parallel connection | d9118d19223a: 40 cores\n", "initializing quartet sets database\n", "[####################] 100% 0:01:48 | inferring full tree * | mean SNPs/quartet: 78381 \n", "[####################] 100% 0:01:32 | bootstrap inference 1 | mean SNPs/quartet: 77664 \n", "[####################] 100% 0:01:33 | bootstrap inference 2 | mean SNPs/quartet: 78121 \n", "[####################] 100% 0:01:33 | bootstrap inference 3 | mean SNPs/quartet: 78356 \n", "[####################] 100% 0:01:33 | bootstrap inference 4 | mean SNPs/quartet: 78399 \n", "[####################] 100% 0:01:32 | bootstrap inference 5 | mean SNPs/quartet: 77797 \n", "[####################] 100% 0:01:33 | bootstrap inference 6 | mean SNPs/quartet: 78836 \n", "[####################] 100% 0:01:32 | bootstrap inference 7 | mean SNPs/quartet: 78550 \n", "[####################] 100% 0:01:32 | bootstrap inference 8 | mean SNPs/quartet: 77943 \n", "[####################] 100% 0:01:32 | bootstrap inference 9 | mean SNPs/quartet: 77753 \n", "[####################] 100% 0:01:33 | bootstrap inference 10 | mean SNPs/quartet: 78646 \n", "[####################] 100% 0:01:33 | bootstrap inference 11 | mean SNPs/quartet: 78200 \n", "[####################] 100% 0:01:32 | bootstrap inference 12 | mean SNPs/quartet: 78073 \n", "[####################] 100% 0:01:32 | bootstrap inference 13 | mean SNPs/quartet: 78056 \n", "[####################] 100% 0:01:33 | bootstrap inference 14 | mean SNPs/quartet: 78491 \n", "[####################] 100% 0:01:32 | bootstrap inference 15 | mean SNPs/quartet: 78709 \n", "[####################] 100% 0:01:32 | bootstrap inference 16 | mean SNPs/quartet: 78777 \n" ] } ], "source": [ "tet.run(auto=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Show the full tree with bootstrap supports" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
FLSF33SCCU3FLBA140TXWV2LALC2FLSA185FLCK216FLMO62FLSF54FLAB109FLWO6FLCK18FLSF47HNDA09CRL0030BZBB1MXSA3017CUVN10CRL0001CUCA4CUSV6CUMM5BJVL19BJSL25BJSB3MXED8MXGT4TXMD3TXGR3ENARDODUreferenceCHHENI1810018251003737754343621001001006837100100100128110010081100100100100100100100100100100100100
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "tre = toytree.tree(tet.trees.tree).root([\"HE\", \"NI\"])\n", "tre.draw(node_labels=\"support\", use_edge_lengths=False);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Show the majority-rule consensus tree with bootstrap supports" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
FLSF33LALC2FLBA140SCCU3TXWV2FLCK216FLMO62FLSA185FLWO6FLSF54FLAB109FLCK18FLSF47CRL0030HNDA09BZBB1MXSA3017CRL0001CUVN10CUCA4CUSV6CUMM5BJVL19BJSL25BJSB3MXED8MXGT4TXGR3TXMD3ENARDODUreferenceCHNIHE3825381003862751005662100100100563810010010069318110010081100100100100100100100100100100100100
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "tre = toytree.tree(tet.trees.cons).root([\"HE\", \"NI\"])\n", "tre.draw(node_labels=\"support\", use_edge_lengths=False);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Show variation over the bootstrap replicates" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
FLBA140LALC2FLSF33SCCU3TXWV2FLMO62FLCK216FLSA185FLWO6FLCK18FLSF54FLAB109FLSF47HNDA09CRL0030BZBB1MXSA3017CRL0001CUVN10CUSV6CUCA4CUMM5BJVL19BJSB3BJSL25MXED8MXGT4TXGR3TXMD3DODUENARreferenceCHHENI
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "mtre = toytree.mtree(tet.trees.boots)\n", "mtre.treelist = [i.root([\"HE\", \"NI\"]) for i in mtre.treelist]\n", "mtre.draw_cloud_tree(\n", " height=600, \n", " width=400,\n", " use_edge_lengths=False, \n", " html=True,\n", ");" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Continuing from a checkpoint\n", "If you want to run more bootstrap replicates you can simply reinit an analysis object with the same `name` and set the number of bootstrap replicates to a higher value and call `.run()` again. If you try calling `.run()` again and you have already completed all of the requested results then it will simply print that it is finished. If you want it to overwrite existing results with the same name then you can use the `force` arg in run. " ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "16 bootstrap result trees already exist for virentes-min4.\n" ] } ], "source": [ "# analysis is finished so it will not run\n", "tet.run()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here I set the number of requested bootstrap replicates to 20 and call `.run()` again. You can see that the analysis continues from 17, since we already completed 16 bootstrap replicates earlier, and will go until it completes 20 bootstraps. " ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Parallel connection | d9118d19223a: 40 cores\n", "[####################] 100% 0:01:49 | bootstrap inference 17 | mean SNPs/quartet: 78350 \n", "[####################] 100% 0:01:34 | bootstrap inference 18 | mean SNPs/quartet: 77653 \n", "[####################] 100% 0:01:32 | bootstrap inference 19 | mean SNPs/quartet: 78560 \n", "[####################] 100% 0:01:32 | bootstrap inference 20 | mean SNPs/quartet: 78380 \n" ] } ], "source": [ "# increase nboots and continue from existing analysis object\n", "tet.params.nboots = 20\n", "tet.run(auto=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alternatively, maybe you are returning to this analysis after a while and decide you want to do more bootstraps. You can re-load the analysis object by entering the same `name` and `working_dir` as in the original analysis, and adding the `load=True` argument. I set the number of bootstraps to 25 now. This will load the results from before and add new results when you call `.run()`." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "# # re-init analysis object (will load existing results at this name)\n", "# tet = ipa.tetrad(\n", "# name=\"virentes-min4\",\n", "# data=data,\n", "# nquartets=1e6,\n", "# nboots=25,\n", "# load=True,\n", "# )\n", "# tet.run(auto=True)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 2 }