{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# ipyrad-analysis toolkit: tetrad\n", "\n", "`tetrad` is a species tree inference tool based on the SVDQuartets algorithm of Chifman and Kubatko. It uses the theory of phylogenetic invariants to resolve quartet trees from SNPs for all sets of quartets in a larger tree, and then joins the quartets together into a supertree using QMC. Here I demonstrate how to call tetrad from the ipyrad-analysis tools, which is convenient for keeping all of your analyses in jupyter notebooks. Alternatively, you could also call tetrad as a command-line tool (see the [tetrad docs](https://github.com/eaton-lab/tetrad)). \n", "\n", "The following features of `tetrad` are highlighted:\n", "\n", "**Parallelization**: tetrad can be massively parallelized on an HPC cluster. It approximately scales linearly with the number of available cores. Using the flexible ipyparallel backend, you can even parallelize over multiple nodes using MPI ([more details coming soon]). \n", "\n", "**Bootstrap sampling**: In contrast to SVDquartets, tetrad is designed particularly to work well with RAD-seq data, though it can apply to any SNP data set. Bootstrap replicates resample the number of loci with replacement to the same size as the original data set. \n", "\n", "**SNP subsampling**: The underlying model assumes that the examined SNPs are unlinked, and tetrad uses a very efficient method to maximize the number of unlinked SNPs used in the analysis. A very crude way to achieve unlinked SNPs would be to sample one SNP per locus before beginning the analysis. However, a site that is variable in the context of all your samples is not necessarily informative for any given quartet of samples. Instead, tetrad subsamples a single SNP from every locus separately for every quartet set in the analysis, and repeats this independently in every bootstrap replicate. This way the maximum amount of unlinked SNP information is used in every quartet inference. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Required software" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# conda install ipyrad -c bioconda\n", "# conda install tetrad -c eaton-lab -c conda-forge" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import ipyrad.analysis as ipa\n", "import toytree" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Input data\n", "\n", "The input data file should be the `.snps.hdf5` file produced by ipyrad (v.0.9.13 or newer). If you did not assemble your data in ipyrad then you can convert your SNPs data to this format from any VCF using the [vcf_to_hdf5 tool](https://ipyrad.readthedocs.io/en/latest/API-analysis/cookbook-vcf2hdf5.html). " ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# the path to your sequence data in HDF5 format\n", "data = \"/home/deren/Documents/virentes-reference/analysis-ipyrad/ref_min4_outfiles/ref_min4.snps.hdf5\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Initialize the analysis object\n", "Here you can enter parameters for the analysis. By default only a subsample of the total quartets will be inferred. To instead infer all quartets simply enter a very large number for the `nquartets` parameter and it will use the maximum. When you initialize the object it will print the size of your dataset, the number of loci, and the number of quartets. The number of loci is of interest because this is the maximum number of SNPs that will be used in an analysis. " ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "loading snps array [37 taxa x 1182005 snps]\n", "max unlinked SNPs per quartet [nloci]: 88938\n", "quartet sampler [full]: 66045 / 66045\n" ] } ], "source": [ "# init analysis object with input data and (optional) parameter options\n", "tet = ipa.tetrad(\n", " name=\"virentes-min4\",\n", " data=data,\n", " nquartets=1e6,\n", " nboots=16,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Call run" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Parallel connection | d9118d19223a: 40 cores\n", "initializing quartet sets database\n", "[####################] 100% 0:01:48 | inferring full tree * | mean SNPs/quartet: 78381 \n", "[####################] 100% 0:01:32 | bootstrap inference 1 | mean SNPs/quartet: 77664 \n", "[####################] 100% 0:01:33 | bootstrap inference 2 | mean SNPs/quartet: 78121 \n", "[####################] 100% 0:01:33 | bootstrap inference 3 | mean SNPs/quartet: 78356 \n", "[####################] 100% 0:01:33 | bootstrap inference 4 | mean SNPs/quartet: 78399 \n", "[####################] 100% 0:01:32 | bootstrap inference 5 | mean SNPs/quartet: 77797 \n", "[####################] 100% 0:01:33 | bootstrap inference 6 | mean SNPs/quartet: 78836 \n", "[####################] 100% 0:01:32 | bootstrap inference 7 | mean SNPs/quartet: 78550 \n", "[####################] 100% 0:01:32 | bootstrap inference 8 | mean SNPs/quartet: 77943 \n", "[####################] 100% 0:01:32 | bootstrap inference 9 | mean SNPs/quartet: 77753 \n", "[####################] 100% 0:01:33 | bootstrap inference 10 | mean SNPs/quartet: 78646 \n", "[####################] 100% 0:01:33 | bootstrap inference 11 | mean SNPs/quartet: 78200 \n", "[####################] 100% 0:01:32 | bootstrap inference 12 | mean SNPs/quartet: 78073 \n", "[####################] 100% 0:01:32 | bootstrap inference 13 | mean SNPs/quartet: 78056 \n", "[####################] 100% 0:01:33 | bootstrap inference 14 | mean SNPs/quartet: 78491 \n", "[####################] 100% 0:01:32 | bootstrap inference 15 | mean SNPs/quartet: 78709 \n", "[####################] 100% 0:01:32 | bootstrap inference 16 | mean SNPs/quartet: 78777 \n" ] } ], "source": [ "tet.run(auto=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Show the full tree with bootstrap supports" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "