{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# NB3: Admixture: Structure analysis\n", "\n", "The data sets used in this notebook were generated with ipyrad (see [notebook here]()). You can re-create the data sets used here by running that notebook. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Table of contents\n", "[Software installation (conda)](#Required-software) \n", "[Parallelized analysis (structure)](#Analysis-structure) \n", "[Ordered barplots (toyplot)](#Plots)\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Required software\n", "All required software can be installed locally using *conda*. I assume here that you already have `ipyrad` installed using conda. " ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "## conda install toytree -c eaton-lab\n", "## conda install ipyrad -c ipyrad \n", "## conda install structure -c ipyrad\n", "## conda install clumpp -c ipyrad" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ipyrad v.0.6.20\n" ] } ], "source": [ "## import packages\n", "import ipyrad as ip\n", "import ipyrad.analysis as ipa\n", "import numpy as np\n", "import toyplot\n", "import toytree\n", "import glob\n", "\n", "## print ipyrad info\n", "print \"ipyrad v.{}\".format(ip.__version__)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Cluster setup\n", "See more information on ipyparallel setup here. " ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "host compute node: [40 cores] on tinus\n" ] } ], "source": [ "## connect to ipyparallel cluster and print information\n", "import ipyparallel as ipp\n", "ipyclient = ipp.Client()\n", "print ip.cluster_info(ipyclient)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### A function to select samples from clades\n", "For a given data set this function returns the samples in it that are from the selected clade. " ] }, { "cell_type": "code", "execution_count": 158, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def get_subsample_names(data, clade):\n", " ## known clades\n", " c1 = [\"tonduzii\", \"maxima\", \"yoponens\", \"glabrata\", \"insipida\"]\n", " c2 = [\"nymph\", \"obtus\", \"pope\", \"bull\", \"citri\", \"paraen\",\n", " \"pertus\", \"perfor\", \"dugan\", \"turbin\", \"colub\", \n", " \"costa\", \"tria\", \"trig\"]\n", " \n", " ## select clades from a dict\n", " clades={\n", " \"pharmacosycea\": c1,\n", " \"americana\": c2,\n", " }\n", " \n", " ## return selected clade names\n", " keys = data.samples.keys()\n", " names = [i for i in keys if any([bit in i for bit in clades[clade]])]\n", " return names" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Analysis `STRUCTURE`\n", "(Bayesian clustering analysis)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Create new assemblies (data files) for subclades\n", "Here we will load an assembly and the use the subsamples argument to `.branch()` to select a subset of samples to include in the new branch. Then we recreate the final output files for the assembly with those subsamples by running step7 of the ipyrad assembly. " ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": true }, "outputs": [], "source": [ "## parent clade to make branches from (has param setting minsamp=20)\n", "parent = ip.load_json(\"ficus_dhi_s20\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", " Assembly: pharma-clade-min20\n", " [####################] 100% filtering loci | 0:00:26 | s7 | \n", " [####################] 100% building loci/stats | 0:00:31 | s7 | \n", " [####################] 100% building alleles | 0:00:35 | s7 | \n", " [####################] 100% building vcf file | 0:00:41 | s7 | \n", " [####################] 100% writing vcf file | 0:00:00 | s7 | \n", " [####################] 100% building arrays | 0:00:13 | s7 | \n", " [####################] 100% writing outfiles | 0:00:03 | s7 | \n", " Outfiles written to: ~/Documents/Ficus/analysis-ipyrad/pharma-clade-min20_outfiles\n", "\n", " Assembly: america-clade-min30\n", " [####################] 100% filtering loci | 0:00:31 | s7 | \n", " [####################] 100% building vcf file | 0:00:43 | s7 | " ] } ], "source": [ "## create subsampled pharma-clade branch\n", "subs = get_subsample_names(parent, 'pharmacosycea')\n", "clade1 = parent.branch(\"pharma-clade-min20\", subsamples=subs)\n", "clade1.set_params(\"min_samples_locus\", 20)\n", "\n", "## create subsampled urostigma clade branch\n", "subs = get_subsample_names(parent, 'americana')\n", "clade2 = parent.branch(\"america-clade-min30\", subsamples=subs)\n", "clade2.set_params(\"min_samples_locus\", 30)\n", "\n", "## assemble data sets\n", "clade1.run(\"7\", force=True)\n", "clade2.run(\"7\", force=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create structure objects\n", "By providing a mapfile each rep that we run will randomly sample a single unlinked SNP from each RAD locus. So each rep has a slightly different data set. " ] }, { "cell_type": "code", "execution_count": 101, "metadata": { "collapsed": true }, "outputs": [], "source": [ "data = ip.load_json(\"./analysis-ipyrad/pharma-clade-min20.json\", 1)\n", "pstruct = ipa.structure(\n", " name=data.name, \n", " workdir=\"./analysis-structure\",\n", " strfile=data.outfiles.str, \n", " mapfile=data.outfiles.snpsmap,\n", ")\n", "\n", "data = ip.load_json(\"./analysis-ipyrad/america-clade-min30.json\", 1)\n", "astruct = ipa.structure(\n", " name=data.name, \n", " workdir=\"./analysis-structure/\",\n", " strfile=data.outfiles.str, \n", " mapfile=data.outfiles.snpsmap, \n", ")" ] }, { "cell_type": "code", "execution_count": 102, "metadata": { "collapsed": true }, "outputs": [], "source": [ "## set params for structure jobs\n", "pstruct.mainparams.burnin = 20000 \n", "pstruct.mainparams.numreps = 100000 \n", "astruct.mainparams.burnin = 20000\n", "astruct.mainparams.numreps = 100000" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Submit jobs to run in parallel" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "submitted 5 structure jobs [pharma-clade-min20-K-2]\n", "submitted 5 structure jobs [pharma-clade-min20-K-3]\n", "submitted 5 structure jobs [pharma-clade-min20-K-4]\n", "submitted 5 structure jobs [pharma-clade-min20-K-5]\n" ] } ], "source": [ "## submit 'pharma' jobs for several values of K\n", "for kpop in [2, 3, 4, 5, 6]:\n", " pstruct.submit_structure_jobs(\n", " kpop=kpop, \n", " nreps=5, \n", " seed=12345, \n", " ipyclient=ipyclient)" ] }, { "cell_type": "code", "execution_count": 103, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "submitted 10 structure jobs [pharma-clade-min20-K-5]\n", "submitted 10 structure jobs [pharma-clade-min20-K-6]\n" ] } ], "source": [ "for kpop in [5,6]:\n", " pstruct.submit_structure_jobs(\n", " kpop=kpop, \n", " nreps=10, \n", " seed=12345, \n", " ipyclient=ipyclient)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "for kpop in [6, 8, 10, 12, 14]:\n", " astruct.submit_structure_jobs(\n", " kpop=kpop,\n", " nreps=6, \n", " seed=54321,\n", " ipyclient=ipyclient)" ] }, { "cell_type": "code", "execution_count": 113, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "submitted 5 structure jobs [america-clade-min30-K-15]\n", "submitted 5 structure jobs [america-clade-min30-K-16]\n" ] } ], "source": [ "for kpop in [15, 16]:\n", " astruct.submit_structure_jobs(\n", " kpop=kpop,\n", " nreps=5, \n", " seed=54321,\n", " ipyclient=ipyclient)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": true }, "outputs": [], "source": [ "## wait for jobs to finish\n", "ipyclient.wait()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Collect results and summarize w/ clumpp" ] }, { "cell_type": "code", "execution_count": 165, "metadata": {}, "outputs": [], "source": [ "pstruct.clumppparams\n", "pstruct.clumppparams.m = 3\n", "pstruct.clumppparams.greedy_option = 2\n", "astruct.clumppparams\n", "astruct.clumppparams.m = 3\n", "astruct.clumppparams.greedy_option = 2\n", "astruct.clumppparams.repeats = 50000" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "mean scores across 5 replicates.\n", "mean scores across 5 replicates.\n", "mean scores across 5 replicates.\n", "mean scores across 5 replicates.\n" ] } ], "source": [ "ptables = {}\n", "for kpop in [2, 3, 4, 5]:\n", " ptables[kpop] = pstruct.get_clumpp_table(kpop)" ] }, { "cell_type": "code", "execution_count": 167, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "mean scores across 10 replicates.\n", "mean scores across 10 replicates.\n" ] } ], "source": [ "ptables = {}\n", "for kpop in [2, 3, 4, 5, 6]:\n", " ptables[kpop] = pstruct.get_clumpp_table(kpop)" ] }, { "cell_type": "code", "execution_count": 139, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "mean scores across 6 replicates.\n", "mean scores across 6 replicates.\n", "mean scores across 6 replicates.\n", "mean scores across 6 replicates.\n", "mean scores across 6 replicates.\n" ] } ], "source": [ "atables = {}\n", "for kpop in [6, 8, 10, 12, 14]:\n", " atables[kpop] = astruct.get_clumpp_table(kpop) " ] }, { "cell_type": "code", "execution_count": 140, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "mean scores across 5 replicates.\n" ] } ], "source": [ "atables[15] = astruct.get_clumpp_table(15) " ] }, { "cell_type": "code", "execution_count": 141, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "mean scores across 3 replicates.\n" ] } ], "source": [ "atables[16] = astruct.get_clumpp_table(16) " ] }, { "cell_type": "code", "execution_count": 150, "metadata": { "collapsed": true }, "outputs": [], "source": [ "atables[15] = t15\n", "atables[16] = t16" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Plot the results" ] }, { "cell_type": "code", "execution_count": 173, "metadata": {}, "outputs": [], "source": [ "## ordered alphabetically for now, replace with this tree-order later\n", "ptre = toytree.tree(\"analysis-raxml/RAxML_bestTree.ficus_dhi_s10\")\n", "ptre.tree.prune(get_subsample_names(parent, \"pharmacosycea\"))\n", "porder = ptre.get_tip_labels()\n", "\n", "atre = toytree.tree(\"analysis-raxml/RAxML_bestTree.ficus_dhi_s10\")\n", "atre.tree.prune(get_subsample_names(parent, \"americana\"))\n", "aorder = atre.get_tip_labels()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Build a color palette" ] }, { "cell_type": "code", "execution_count": 146, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
" ], "text/plain": [ "