{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# NB3: Phylogeny: species tree \n", "\n", "The data sets used in this notebook were generated with ipyrad (see [notebook here]()). You can re-create the data sets used here by running that notebook. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Table of contents\n", "[Software installation (conda)](#Required-software) \n", "[Phylogenetic analysis (tetrad)](#Analysis-RAxML) \n", "[Tree plots (toytree)](#Tree plot)\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Required software\n", "All software required for this notebook can be installed locally using *conda*. " ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "## conda install toytree -c eaton-lab\n", "## conda install ipyrad -c ipyrad " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ipyrad v.0.6.20\n" ] } ], "source": [ "## import packages\n", "import ipyrad as ip\n", "import ipyrad.analysis as ipa\n", "import toytree\n", "\n", "## print ipyrad info\n", "print \"ipyrad v.{}\".format(ip.__version__)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Cluster setup\n", "We will distribute jobs across an HPC cluster using the ipyparallel library (which is installed as a dependency of ipyrad). Start an ipcluster instance and check your connection below. " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "## print ipyparallel cluster information\n", "import ipyparallel as ipp\n", "ipyclient = ipp.Client()\n", "#print ip.cluster_info(ipyclient)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### A function to select samples from clades\n", "For a given data set this function returns the samples in it that are from the selected clade. " ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def get_subsample_names(data, clade):\n", " ## known clades\n", " c1 = [\"tonduzii\", \"maxima\", \"yoponens\", \"glabrata\", \"insipida\"]\n", " c2 = [\"nymph\", \"obtus\", \"pope\", \"bull\", \"citri\", \"paraen\",\n", " \"pertus\", \"perfor\", \"dugan\", \"turbin\", \"colub\", \n", " \"costa\", \"tria\", \"trig\"]\n", " \n", " ## select clades from a dict\n", " clades={\n", " \"pharmacosycea\": c1,\n", " \"americana\": c2,\n", " }\n", " \n", " ## return selected clade names\n", " keys = data.samples.keys()\n", " names = [i for i in keys if any([bit in i for bit in clades[clade]])]\n", " return names" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Data sets\n", "\n", "We will use three data sets that were created already. One \"full\" data set and two subsampled data sets. " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " loading Assembly: ficus_dhi_s4\n", " from saved path: ~/Documents/Ficus/analysis-ipyrad/ficus_dhi_s4.json\n", "\n", " Assembly: pharma_dhi_s4\n", " [####################] 100% filtering loci | 0:00:17 | s7 | \n", " [####################] 100% building loci/stats | 0:00:35 | s7 | \n", " [####################] 100% building alleles | 0:00:40 | s7 | \n", " [####################] 100% building vcf file | 0:00:51 | s7 | \n", " [####################] 100% writing vcf file | 0:00:00 | s7 | \n", " [####################] 100% building arrays | 0:00:14 | s7 | \n", " [####################] 100% writing outfiles | 0:00:14 | s7 | \n", " Outfiles written to: ~/Documents/Ficus/analysis-ipyrad/pharma_dhi_s4_outfiles\n", "\n", " Assembly: america_dhi_s4\n", " [####################] 100% filtering loci | 0:00:36 | s7 | \n", " [####################] 100% building loci/stats | 0:00:37 | s7 | \n", " [####################] 100% building alleles | 0:00:45 | s7 | \n", " [####################] 100% building vcf file | 0:01:16 | s7 | \n", " [####################] 100% writing vcf file | 0:00:00 | s7 | \n", " [####################] 100% building arrays | 0:00:29 | s7 | \n", " [####################] 100% writing outfiles | 0:00:44 | s7 | \n", " Outfiles written to: ~/Documents/Ficus/analysis-ipyrad/america_dhi_s4_outfiles\n" ] } ], "source": [ "## the full large data set\n", "full = ip.load_json(\"analysis-ipyrad/ficus_dhi_s4.json\")\n", "\n", "## pharma subsampled data set\n", "pharma = full.branch(\"pharma_dhi_s4\", \n", " subsamples=get_subsample_names(full, \"pharmacosycea\"))\n", "pharma.set_params(\"min_samples_locus\", 4)\n", "pharma.run(\"7\", force=True)\n", "\n", "## americana subsampled data set\n", "america = full.branch(\"america_dhi_s4\", \n", " subsamples=get_subsample_names(full, \"americana\"))\n", "america.set_params(\"min_samples_locus\", 4)\n", "america.run(\"7\", force=True) " ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " loading Assembly: ficus_dhi_s4\n", " from saved path: ~/Documents/Ficus/analysis-ipyrad/ficus_dhi_s4.json\n", " loading Assembly: pharma_dhi_s4\n", " from saved path: ~/Documents/Ficus/analysis-ipyrad/pharma_dhi_s4.json\n", " loading Assembly: america_dhi_s4\n", " from saved path: ~/Documents/Ficus/analysis-ipyrad/america_dhi_s4.json\n" ] } ], "source": [ "full = ip.load_json(\"analysis-ipyrad/ficus_dhi_s4.json\")\n", "pharma = ip.load_json(\"analysis-ipyrad/pharma_dhi_s4.json\")\n", "america = ip.load_json(\"analysis-ipyrad/america_dhi_s4.json\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Analysis `tetrad`\n", "(Inference by phylogenetic invariants within quartets)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "loading seq array [76 taxa x 547987 bp]\n", "max unlinked SNPs per quartet (nloci): 76904\n", "loading seq array [23 taxa x 143884 bp]\n", "max unlinked SNPs per quartet (nloci): 41432\n", "loading seq array [53 taxa x 361100 bp]\n", "max unlinked SNPs per quartet (nloci): 56067\n" ] } ], "source": [ "## create tetrad object\n", "fulltet = ipa.tetrad(\n", " full.name,\n", " seqfile=full.outfiles.snpsphy, \n", " mapfile=full.outfiles.snpsmap,\n", " workdir=\"analysis-tetrad\", \n", " nboots=100,\n", " );\n", "\n", "pharmatet = ipa.tetrad(\n", " pharma.name,\n", " seqfile=pharma.outfiles.snpsphy, \n", " mapfile=pharma.outfiles.snpsmap,\n", " workdir=\"analysis-tetrad\", \n", " nboots=100,\n", " );\n", "\n", "americatet = ipa.tetrad(\n", " america.name,\n", " seqfile=america.outfiles.snpsphy, \n", " mapfile=america.outfiles.snpsmap,\n", " workdir=\"analysis-tetrad\", \n", " nboots=100,\n", " );\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Run it" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pharmatet.run(force=True, ipyclient=ipyclient)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "americatet.run(ipyclient=ipyclient)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "fulltet.run(ipyclient=ipyclient)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Plot trees" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import numpy as np\n", "import toytree\n", "import toyplot\n", "import toyplot.pdf" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "