{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Canarium GBS: *structure* analyses\n", "### *Federman et al.*\n", "\n", "This notebook provides all code necessary to reproduce the structure analyses. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Required software" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "## conda install ipyrad -c ipyrad\n", "## conda install toytree -c eaton-lab\n", "## conda install structure -c ipyrad\n", "## conda install clumpp -c ipyrad" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Imports" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ipyrad v.0.7.22\n" ] } ], "source": [ "import toytree\n", "import toyplot.svg\n", "import numpy as np\n", "import ipyrad as ip\n", "import ipyrad.analysis as ipa\n", "print \"ipyrad v.{}\".format(ip.__version__)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Connect to cluster" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "host compute node: [40 cores] on sacra\n" ] } ], "source": [ "import ipyparallel as ipp\n", "ipyclient = ipp.Client()\n", "ip.cluster_info(ipyclient)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "## Population structure " ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "## define k vals to test\n", "tests = [2, 3, 4, 5, 6, 7, 8, 9, 10]\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "## init structure analysis object\n", "s = ipa.structure(\n", " name=\"Canarium-min30-nout\", \n", " data=\"analysis-ipyrad/Canarium-min30-nout_outfiles/Canarium-min30-nout.str\",\n", " mapfile=\"analysis-ipyrad/Canarium-min30-nout_outfiles/Canarium-min30-nout.snps.map\",\n", ")" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": true }, "outputs": [], "source": [ "## set mainparams for object\n", "s.mainparams.burnin = 50000\n", "s.mainparams.numreps = 100000" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "submitted 20 structure jobs [Canarium-min30-nout-K-2]\n", "submitted 20 structure jobs [Canarium-min30-nout-K-3]\n", "submitted 20 structure jobs [Canarium-min30-nout-K-4]\n", "submitted 20 structure jobs [Canarium-min30-nout-K-5]\n", "submitted 20 structure jobs [Canarium-min30-nout-K-6]\n", "submitted 20 structure jobs [Canarium-min30-nout-K-7]\n", "submitted 20 structure jobs [Canarium-min30-nout-K-8]\n", "submitted 20 structure jobs [Canarium-min30-nout-K-9]\n", "submitted 20 structure jobs [Canarium-min30-nout-K-10]\n" ] } ], "source": [ "## submit batches of 20 replicate jobs for each value of K \n", "for kpop in tests:\n", " s.run(\n", " kpop=kpop, \n", " nreps=20, \n", " seed=12345,\n", " ipyclient=ipyclient,\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Run some jobs longer\n", "For the high K values it seems that some of them had few or no reps that converged, so we ran an additional 20 reps for each K from 6-10 for a longer number of generations to get better convergence." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": true }, "outputs": [], "source": [ "## set longer run times for additional reps on large Ks\n", "s.mainparams.burnin = 100000\n", "s.mainparams.numreps = 500000\n", "moretests = [6, 7, 8, 9, 10]" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "submitted 20 structure jobs [Canarium-min30-nout-K-6]\n", "submitted 20 structure jobs [Canarium-min30-nout-K-7]\n", "submitted 20 structure jobs [Canarium-min30-nout-K-8]\n", "submitted 20 structure jobs [Canarium-min30-nout-K-9]\n", "submitted 20 structure jobs [Canarium-min30-nout-K-10]\n" ] } ], "source": [ "## submit batches of 20 replicate jobs for each high value of K \n", "for kpop in moretests:\n", " s.run(\n", " kpop=kpop, \n", " nreps=20, \n", " seed=12345,\n", " ipyclient=ipyclient,\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Summarize results\n", "\n", "### Get Evanno table for best fitting K\n", "Below are Evanno tables calculated for several different values of the filtering parameter `max_var_multiple`, which is used to exclude replicates that have a variance of LnLik N times greater than the rep with minimum variance for that value of K. We find that with `max_var_multiple=100` we get approximately 20 reps in every test that passes filtering. The goal here is to exclude reps that may not have converged and therefore would otherwise lower the score for that value of K. " ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NrepsdeltaKestLnProbMeanestLnProbStdevlnPKlnPPK
2200.000-1.537e+052.406e+030.000e+000.000e+00
3202.029-1.373e+051.043e+041.638e+042.116e+04
4201235.503-1.421e+051.556e+03-4.773e+031.923e+06
5200.847-2.069e+063.035e+06-1.927e+062.569e+06
6390.674-1.427e+063.116e+066.420e+052.099e+06
7390.284-2.884e+065.253e+06-1.457e+061.493e+06
8390.143-2.848e+065.440e+063.654e+047.805e+05
9390.308-3.592e+065.755e+06-7.440e+051.775e+06
10390.000-2.561e+063.312e+061.031e+060.000e+00
\n", "
" ], "text/plain": [ " Nreps deltaK estLnProbMean estLnProbStdev lnPK lnPPK\n", "2 20 0.000 -1.537e+05 2.406e+03 0.000e+00 0.000e+00\n", "3 20 2.029 -1.373e+05 1.043e+04 1.638e+04 2.116e+04\n", "4 20 1235.503 -1.421e+05 1.556e+03 -4.773e+03 1.923e+06\n", "5 20 0.847 -2.069e+06 3.035e+06 -1.927e+06 2.569e+06\n", "6 39 0.674 -1.427e+06 3.116e+06 6.420e+05 2.099e+06\n", "7 39 0.284 -2.884e+06 5.253e+06 -1.457e+06 1.493e+06\n", "8 39 0.143 -2.848e+06 5.440e+06 3.654e+04 7.805e+05\n", "9 39 0.308 -3.592e+06 5.755e+06 -7.440e+05 1.775e+06\n", "10 39 0.000 -2.561e+06 3.312e+06 1.031e+06 0.000e+00" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s.get_evanno_table(tests, max_var_multiple=0, quiet=True)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NrepsdeltaKestLnProbMeanestLnProbStdevlnPKlnPPK
2200.000-153726.1852405.5080.0000.000
3202.029-137343.78510428.47316382.40021155.730
4204.618-142117.1151556.118-4773.3307186.853
5130.792-139703.5922555.6622413.5232025.005
6241.780-139315.0757703.825388.51713710.965
7130.581-152637.52336075.618-13322.44820948.565
8170.646-145011.40632030.1657626.11720693.599
980.218-158078.88835401.179-13067.4827730.769
1040.000-163415.60031602.319-5336.7120.000
\n", "
" ], "text/plain": [ " Nreps deltaK estLnProbMean estLnProbStdev lnPK lnPPK\n", "2 20 0.000 -153726.185 2405.508 0.000 0.000\n", "3 20 2.029 -137343.785 10428.473 16382.400 21155.730\n", "4 20 4.618 -142117.115 1556.118 -4773.330 7186.853\n", "5 13 0.792 -139703.592 2555.662 2413.523 2025.005\n", "6 24 1.780 -139315.075 7703.825 388.517 13710.965\n", "7 13 0.581 -152637.523 36075.618 -13322.448 20948.565\n", "8 17 0.646 -145011.406 32030.165 7626.117 20693.599\n", "9 8 0.218 -158078.888 35401.179 -13067.482 7730.769\n", "10 4 0.000 -163415.600 31602.319 -5336.712 0.000" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s.get_evanno_table(tests, max_var_multiple=10, quiet=True)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NrepsdeltaKestLnProbMeanestLnProbStdevlnPKlnPPK
2200.000-153726.1852405.5080.0000.000
3202.029-137343.78510428.47316382.40021155.730
4204.618-142117.1151556.118-4773.3307186.853
5139.085-139703.5922555.6622413.52323218.227
6260.645-160508.29682413.941-20804.70453179.488
7170.594-234492.488168026.173-73984.19299840.352
8211.320-208636.329152555.06525856.160201418.181
9200.423-384198.350230926.987-175562.02197648.791
10200.000-462111.580221353.039-77913.2300.000
\n", "
" ], "text/plain": [ " Nreps deltaK estLnProbMean estLnProbStdev lnPK lnPPK\n", "2 20 0.000 -153726.185 2405.508 0.000 0.000\n", "3 20 2.029 -137343.785 10428.473 16382.400 21155.730\n", "4 20 4.618 -142117.115 1556.118 -4773.330 7186.853\n", "5 13 9.085 -139703.592 2555.662 2413.523 23218.227\n", "6 26 0.645 -160508.296 82413.941 -20804.704 53179.488\n", "7 17 0.594 -234492.488 168026.173 -73984.192 99840.352\n", "8 21 1.320 -208636.329 152555.065 25856.160 201418.181\n", "9 20 0.423 -384198.350 230926.987 -175562.021 97648.791\n", "10 20 0.000 -462111.580 221353.039 -77913.230 0.000" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s.get_evanno_table(tests, max_var_multiple=50, quiet=True)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NrepsdeltaKestLnProbMeanestLnProbStdevlnPKlnPPK
2200.000-153726.1852405.5080.0000.000
3202.029-137343.78510428.47316382.40021155.730
4204.618-142117.1151556.118-4773.3307186.853
51340.533-139703.5922555.6622413.523103589.589
6290.894-240879.659254325.896-101176.066227331.040
7261.045-569386.765506552.755-328507.107529423.850
8271.071-368470.022342168.594200916.743366448.612
9230.343-534001.891450635.181-165531.869154559.665
10220.000-544974.095341123.401-10972.2040.000
\n", "
" ], "text/plain": [ " Nreps deltaK estLnProbMean estLnProbStdev lnPK lnPPK\n", "2 20 0.000 -153726.185 2405.508 0.000 0.000\n", "3 20 2.029 -137343.785 10428.473 16382.400 21155.730\n", "4 20 4.618 -142117.115 1556.118 -4773.330 7186.853\n", "5 13 40.533 -139703.592 2555.662 2413.523 103589.589\n", "6 29 0.894 -240879.659 254325.896 -101176.066 227331.040\n", "7 26 1.045 -569386.765 506552.755 -328507.107 529423.850\n", "8 27 1.071 -368470.022 342168.594 200916.743 366448.612\n", "9 23 0.343 -534001.891 450635.181 -165531.869 154559.665\n", "10 22 0.000 -544974.095 341123.401 -10972.204 0.000" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s.get_evanno_table(tests, max_var_multiple=100, quiet=True)" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NrepsdeltaKestLnProbMeanestLnProbStdevlnPKlnPPK
2200.000-1.537e+052.406e+030.000e+000.000e+00
3202.029-1.373e+051.043e+041.638e+042.116e+04
420983.330-1.421e+051.556e+03-4.773e+031.530e+06
5190.967-1.677e+062.544e+06-1.535e+062.461e+06
6372.031-7.510e+051.030e+069.261e+052.093e+06
7370.413-1.918e+062.972e+06-1.167e+061.227e+06
8370.250-1.858e+063.142e+065.974e+047.863e+05
9370.199-2.585e+063.780e+06-7.265e+057.504e+05
10390.000-2.561e+063.312e+062.385e+040.000e+00
\n", "
" ], "text/plain": [ " Nreps deltaK estLnProbMean estLnProbStdev lnPK lnPPK\n", "2 20 0.000 -1.537e+05 2.406e+03 0.000e+00 0.000e+00\n", "3 20 2.029 -1.373e+05 1.043e+04 1.638e+04 2.116e+04\n", "4 20 983.330 -1.421e+05 1.556e+03 -4.773e+03 1.530e+06\n", "5 19 0.967 -1.677e+06 2.544e+06 -1.535e+06 2.461e+06\n", "6 37 2.031 -7.510e+05 1.030e+06 9.261e+05 2.093e+06\n", "7 37 0.413 -1.918e+06 2.972e+06 -1.167e+06 1.227e+06\n", "8 37 0.250 -1.858e+06 3.142e+06 5.974e+04 7.863e+05\n", "9 37 0.199 -2.585e+06 3.780e+06 -7.265e+05 7.504e+05\n", "10 39 0.000 -2.561e+06 3.312e+06 2.385e+04 0.000e+00" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s.get_evanno_table(tests, max_var_multiple=1000, quiet=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Get permuted reps with CLUMPP\n", "We calculate a permuted table of results across replicate runs for each value of K while excluding reps based on the max_var_multiple parameter (see above for description). " ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": true }, "outputs": [], "source": [ "## summarize results\n", "s.clumppparams.m = 3 ## use largegreedy algorithm\n", "s.clumppparams.greedy_option = 2 ## test nrepeat possible orders\n", "s.clumppparams.repeats = 100000 ## number of repeats" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[K2] 20/20 results permuted across replicates (max_var=100.0).\n", "[K3] 20/20 results permuted across replicates (max_var=100.0).\n", "[K4] 20/20 results permuted across replicates (max_var=100.0).\n", "[K5] 13/20 results permuted across replicates (max_var=100.0).\n", "[K6] 29/39 results permuted across replicates (max_var=100.0).\n", "[K7] 26/39 results permuted across replicates (max_var=100.0).\n", "[K8] 27/39 results permuted across replicates (max_var=100.0).\n", "[K9] 23/39 results permuted across replicates (max_var=100.0).\n", "[K10] 22/39 results permuted across replicates (max_var=100.0).\n" ] } ], "source": [ "qtable = s.get_clumpp_table(tests, max_var_multiple=100.)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[K2] 20/20 results permuted across replicates (max_var=100.0).\n", "[K3] 20/20 results permuted across replicates (max_var=100.0).\n", "[K4] 20/20 results permuted across replicates (max_var=100.0).\n", "[K5] 13/20 results permuted across replicates (max_var=100.0).\n", "[K6] 29/39 results permuted across replicates (max_var=100.0).\n", "[K7] 26/39 results permuted across replicates (max_var=100.0).\n", "[K8] 27/39 results permuted across replicates (max_var=100.0).\n", "[K9] 23/39 results permuted across replicates (max_var=100.0).\n", "[K10] 22/39 results permuted across replicates (max_var=100.0).\n" ] } ], "source": [ "qtable = s.get_clumpp_table(tests, max_var_multiple=100.)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Barplots of population structure" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": true }, "outputs": [], "source": [ "## load a tree topology for ordering tips on barplot\n", "tre = toytree.tree(\"./analysis-raxml/RAxML_bipartitions.Canarium-min20\")\n", "\n", "## root on outgroups and then drop outgroups from the tree\n", "outs = [\"D14269\", \"D13374\", \"D13852\", \"SFC1988\"]\n", "rtre = tre.root(outs).drop_tips(outs)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": true }, "outputs": [], "source": [ "for kpop in tests:\n", " c = toyplot.Canvas(width=550, height=110)\n", " a = c.cartesian(margin=25) \n", " minpos = np.linspace(0, 38, 38)\n", " maxpos = np.linspace(0.9, 38.9, 38)\n", " m = a.bars(minpos, maxpos,\n", " qtable[kpop].loc[rtre.get_tip_labels()[::-1],],\n", " color = toyplot.color.brewer.map(\"Set3\"),\n", " style={\"stroke-width\": 0.5},\n", " )\n", " a.show = False\n", " \n", " ## save figure\n", " import toyplot.svg\n", " toyplot.svg.render(c, \"figures/structure-k{}.svg\".format(kpop)) " ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import toyplot\n", "import numpy as np\n", "\n", "## create canvas\n", "c = toyplot.Canvas(width=600, height=160)\n", "a = c.cartesian()\n", "\n", "## order of colors\n", "order = [2, 7, 0, 5, 4, 1, 6, 3][::-1]\n", "\n", "## space between bars\n", "minpos = np.linspace(0, 38, 38)\n", "maxpos = np.linspace(0.9, 38.9, 38)\n", "\n", "## plot bars\n", "m = a.bars(minpos, maxpos, \n", " qtable[8].loc[rtre.get_tip_labels()[::-1], order],\n", " color = toyplot.color.brewer.map(\"Set3\"),\n", " style={\"stroke-width\": 0})\n", "a.show = False\n", "\n", "## save figure\n", "import toyplot.svg\n", "toyplot.svg.render(c, \"figures/structure-final.svg\")\n", "c" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.13" } }, "nbformat": 4, "nbformat_minor": 1 }