{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# ipyrad-analysis toolkit: STRUCTURE\n", "\n", "Structure v.2.3.4 is a standard tool for examining population genetic structure based on allele frequencies within and among populations. Although many new implementations of the structure algorithm have been developed in recent years offering improvements to speed, the classic tool offers a number of useful options that keep it relevant to this day. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Required software" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# conda install ipyrad -c conda-forge -c bioconda\n", "# conda install structure clumpp -c ipyrad\n", "# conda install toyplot -c conda-forge" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import ipyrad.analysis as ipa\n", "import toyplot" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Required input data files\n", "Your input data should be a `.snps.hdf` database file produced by ipyrad. If you do not have this you can generate it from any VCF file following the [vcf2hdf5 tool tutorial](https://ipyrad.readthedocs.io/en/latest/API-analysis/cookbook-vcf2hdf5.html). The database file contains the genotype calls information as well as linkage information that is used for subsampling unlinked SNPs and bootstrap resampling." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# the path to your .snps.hdf5 database file\n", "SNPS = \"/tmp/oaks.snps.hdf5\"" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "file already exists\n" ] } ], "source": [ "# download example hdf5 dataset (158Mb, takes ~2-3 minutes)\n", "URL = \"https://www.dropbox.com/s/x6a4i47xqum27fo/virentes_ref.snps.hdf5?raw=1\"\n", "ipa.download(url=URL, path=SNPS);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Note: missing data in STRUCTURE analyses:\n", "\n", "Structure infers the values of missing data while it runs the MCMC chain. No imputation is required, but it will perform more accurately if there is less missing data and when base calls are more accurate. For RAD-seq data I generally recommend not imputing data and simply filtering fairly stringently (e.g., <10-20% missing). \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Approximate run times\n", "This example data set is fairly large in terms of the number of SNPs, but not that big in number of samples. Both factors affect run times. If you have many processors available you can parallelize runs so that many replicates across many K values can be run simultaneously. Still, expect run times to take many hours or even days." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Input data file and population assignments\n", "The IMAP population assignments are used to select which samples from the dataset to include in the analysis, and also can be used to apply filtering to require that data is distributed across all relevant groups, using minmap settings. For more information on IMAP see the general `ipa` tutorial." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# group individuals into populations\n", "IMAP = {\n", " \"virg\": [\"LALC2\", \"SCCU3\", \"FLSF33\", \"FLBA140\"],\n", " \"mini\": [\"FLSF47\", \"FLMO62\", \"FLSA185\", \"FLCK216\"],\n", " \"gemi\": [\"FLCK18\", \"FLSF54\", \"FLWO6\", \"FLAB109\"],\n", " \"sagr\": [\"CUVN10\", \"CUCA4\", \"CUSV6\"],\n", " \"oleo\": [\"CRL0030\", \"HNDA09\", \"BZBB1\", \"MXSA3017\"],\n", " \"fusi\": [\"MXED8\", \"MXGT4\", \"TXGR3\", \"TXMD3\"],\n", " \"bran\": [\"BJSL25\", \"BJSB3\", \"BJVL19\"],\n", "}\n", "\n", "# require that 50% of samples have data in each group\n", "MINMAP = {i: 0.5 for i in IMAP}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Initialize a structure tool with data and params\n", "Here we init a structure analysis tool instance. The `name` and `workdir` settings tell it where to save the output files. The `imap` and `minmap` settings are used to subselect samples and filter for missing data based on groups. The `mincov` applies filters for missing data based on all samples. The `minmaf` filter removes sites with a minor allele frequency cutoff.\n", "\n", "When you init the object it will load the data and apply filtering. The printed output tells you how many SNPs were removed by each filter and the remaining number of missing data after filtering. By default the `ipa.structure` tool samples *a random set of unlinked SNPs for each replicate analysis*, which represents a single SNP from each RAD locus. Thus, the replicate analyses in ipa.structure represent variation over different random samples of SNPs." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "25 previous results loaded for run [test]\n", "Samples: 26\n", "Sites before filtering: 1182005\n", "Filtered (indels): 0\n", "Filtered (bi-allel): 26588\n", "Filtered (mincov): 958870\n", "Filtered (minmap): 717351\n", "Filtered (subsample invariant): 592354\n", "Filtered (minor allele frequency): 56920\n", "Filtered (combined): 1809076\n", "Sites after filtering: 46962\n", "Sites containing missing values: 25241 (53.75%)\n", "Missing values in SNP matrix: 25241 (2.07%)\n", "SNPs (total): 46962\n", "SNPs (unlinked): 15887\n" ] } ], "source": [ "# init analysis object with input data and (optional) parameter options\n", "struct = ipa.structure(\n", " name=\"test\",\n", " workdir=\"analysis-structure\",\n", " data=SNPS,\n", " imap=IMAP,\n", " minmap=MINMAP,\n", " mincov=0.95,\n", " minmaf=0.05,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Run STRUCTURE\n", "The `burnin` and `numreps` parameters determine the length of the run. You can access other optional parameters of structure from the mainparams attribute." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "struct.mainparams.burnin = 50000\n", "struct.mainparams.numreps = 500000" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Distribute parallel jobs\n", "See the general ipa tutorial for more details on parallelization. Here we select 5 K values to distribute tests over, and select 5 replicates for each test, for a total of 25 analyses. If you have 25 cores available then all jobs will start, else the extra jobs will be queued and start when the others finish. The progress bar tracks finished jobs, so it can jump from 0 to 50 or 100% in big strides." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[####################] 100% 20:31:17 | running 25 structure jobs \n" ] } ], "source": [ "struct.ipcluster['cores'] = 25\n", "struct.run(nreps=5, kpop=[2, 3, 4, 5, 6], auto=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Analyze results: Choosing K\n", "\n", "The `get_evanno_table` function will summarize finished results from your structure tool. (You can reload results later by re-initing a structure tool with the same name and workdir; see bottom of this notebook.) You can then use the plotting call below to visualize the deltaK value that is recommended for choosing the best K by the Evanno method. I'm of the camp that believes you shouldn't read into these model fitting exercises too much. Show results for multiple K values as seems fitting to describe diversity in your system. \n", "\n", "There is an optional argument you can use called `max_var_multiple`. This will filter out any replicate runs that have a estLnProbStdev that is N times greater than the replicate run with the smallest value. In other words, if some replicates have very large variance it may represent their failure to converge, and so they will be excluded. Try values like 2 or 3 for this option to explore its effect on your results." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NrepslnPKlnPPKdeltaKestLnProbMeanestLnProbStdev
250.000e+000.000e+000.000-2.984e+051.710e+03
353.355e+042.758e+0446.695-2.648e+055.907e+02
455.966e+031.102e+064484.424-2.589e+052.458e+02
55-1.096e+060.000e+000.000-1.355e+062.450e+06
\n", "
" ], "text/plain": [ " Nreps lnPK lnPPK deltaK estLnProbMean estLnProbStdev\n", "2 5 0.000e+00 0.000e+00 0.000 -2.984e+05 1.710e+03\n", "3 5 3.355e+04 2.758e+04 46.695 -2.648e+05 5.907e+02\n", "4 5 5.966e+03 1.102e+06 4484.424 -2.589e+05 2.458e+02\n", "5 5 -1.096e+06 0.000e+00 0.000 -1.355e+06 2.450e+06" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "etable = struct.get_evanno_table([2, 3, 4, 5])\n", "etable" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
2345K (N ancestral populations)20000060000010000001400000estLnProbMean0200040006000deltaK
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# get canvas object and set size\n", "canvas = toyplot.Canvas(width=400, height=300)\n", "\n", "# plot the mean log probability of the models in red\n", "axes = canvas.cartesian(ylabel=\"estLnProbMean\")\n", "axes.plot(etable.estLnProbMean * -1, color=\"darkred\", marker=\"o\")\n", "axes.y.spine.style = {\"stroke\": \"darkred\"}\n", "\n", "# plot delta K with its own scale bar of left side and in blue\n", "axes = axes.share(\"x\", ylabel=\"deltaK\", ymax=etable.deltaK.max() + etable.deltaK.max() * .25)\n", "axes.plot(etable.deltaK, color=\"steelblue\", marker=\"o\");\n", "axes.y.spine.style = {\"stroke\": \"steelblue\"}\n", "\n", "# set x labels\n", "axes.x.ticks.locator = toyplot.locator.Explicit(range(len(etable.index)), etable.index)\n", "axes.x.label.text = \"K (N ancestral populations)\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Analyze results: Barplots" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[K4] 5/5 results permuted across replicates (max_var=0).\n" ] } ], "source": [ "# choose a K value and run clumpp to permute results\n", "k = 4\n", "table = struct.get_clumpp_table(k)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Sort results table to the order of your IMAP dict" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "# sort list by columns\n", "table.sort_values(by=list(range(k)), inplace=True)\n", "\n", "# or, sort by a list of names (here taken from imap)\n", "import itertools\n", "onames = list(itertools.chain(*IMAP.values()))\n", "table = table.loc[onames]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Plot barplot" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
LALC2SCCU3FLSF33FLBA140FLSF47FLMO62FLSA185FLCK216FLCK18FLSF54FLWO6FLAB109CUVN10CUCA4CUSV6CRL0030HNDA09BZBB1MXSA3017MXED8MXGT4TXGR3TXMD3BJSL25BJSB3BJVL190.00.51.0
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# build barplot\n", "canvas = toyplot.Canvas(width=500, height=250)\n", "axes = canvas.cartesian(bounds=(\"10%\", \"90%\", \"10%\", \"45%\"))\n", "axes.bars(table)\n", "\n", "# add labels to x-axis\n", "ticklabels = [i for i in table.index.tolist()]\n", "axes.x.ticks.locator = toyplot.locator.Explicit(labels=ticklabels)\n", "axes.x.ticks.labels.angle = -60\n", "axes.x.ticks.show = True\n", "axes.x.ticks.labels.offset = 10\n", "axes.x.ticks.labels.style = {\"font-size\": \"12px\"}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Options: Load existing results\n", "\n", "Results files can be loaded by providing the `name` and `workdir` combination that leads to the path where your previous results were stored. " ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "12 previous results loaded for run [test]\n" ] } ], "source": [ "rerun = ipa.structure(\n", " data=data, \n", " name=\"test\", \n", " workdir=\"analysis-structure\",\n", " imap=imap,\n", " load_only=True,\n", ")" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[K3] 3/3 results permuted across replicates (max_var=0).\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
BJSB30.00001.00000.0000
BJSL250.00001.00000.0000
BJVL190.00001.00000.0000
BZBB10.00000.00001.0000
CRL00300.00000.00001.0000
CUCA40.34500.00000.6550
CUSV60.40980.00000.5902
CUVN100.34080.00000.6592
FLAB1091.00000.00000.0000
FLBA1401.00000.00000.0000
FLCK181.00000.00000.0000
FLCK2161.00000.00000.0000
FLMO620.99870.00100.0003
FLSA1851.00000.00000.0000
FLSF331.00000.00000.0000
FLSF471.00000.00000.0000
FLSF541.00000.00000.0000
FLWO61.00000.00000.0000
HNDA090.00000.00001.0000
LALC21.00000.00000.0000
MXED80.17600.69530.1287
MXGT40.15310.80930.0377
MXSA30170.04770.00130.9510
SCCU31.00000.00000.0000
TXGR30.36490.62670.0083
TXMD30.39870.60100.0003
TXWV21.00000.00000.0000
\n", "
" ], "text/plain": [ " 0 1 2\n", "BJSB3 0.0000 1.0000 0.0000\n", "BJSL25 0.0000 1.0000 0.0000\n", "BJVL19 0.0000 1.0000 0.0000\n", "BZBB1 0.0000 0.0000 1.0000\n", "CRL0030 0.0000 0.0000 1.0000\n", "CUCA4 0.3450 0.0000 0.6550\n", "CUSV6 0.4098 0.0000 0.5902\n", "CUVN10 0.3408 0.0000 0.6592\n", "FLAB109 1.0000 0.0000 0.0000\n", "FLBA140 1.0000 0.0000 0.0000\n", "FLCK18 1.0000 0.0000 0.0000\n", "FLCK216 1.0000 0.0000 0.0000\n", "FLMO62 0.9987 0.0010 0.0003\n", "FLSA185 1.0000 0.0000 0.0000\n", "FLSF33 1.0000 0.0000 0.0000\n", "FLSF47 1.0000 0.0000 0.0000\n", "FLSF54 1.0000 0.0000 0.0000\n", "FLWO6 1.0000 0.0000 0.0000\n", "HNDA09 0.0000 0.0000 1.0000\n", "LALC2 1.0000 0.0000 0.0000\n", "MXED8 0.1760 0.6953 0.1287\n", "MXGT4 0.1531 0.8093 0.0377\n", "MXSA3017 0.0477 0.0013 0.9510\n", "SCCU3 1.0000 0.0000 0.0000\n", "TXGR3 0.3649 0.6267 0.0083\n", "TXMD3 0.3987 0.6010 0.0003\n", "TXWV2 1.0000 0.0000 0.0000" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rerun.get_clumpp_table(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Options: Add replicates or additional K values\n", "\n", "You can continue an analysis with the same `name` and `workdir` by setting additional replicates or values of K and calling `.run()` again. Here I will increase the number of replicates per K value from 3 to 5, and run one additional K value. Be sure to use all of the same parameter and filtering values that you used in the previous run or you might cause unexpected problems. \n", "\n", "Here because we already finished 3 replicates for K=2,3,4,5 it will run 2 more for each of those, and it will run 5 replicates for K=6 since we do not have any finished replicates of those yet. You can see which result files exist for a named analysis object by accessing the `.result_files` attribute, or by looking in the working directory. To overwrite existing files instead of adding more replicates you can use `force=True` in the run command. You could also simply create a new object with a different name. " ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "12 previous results loaded for run [test]\n", "Samples: 27\n", "Sites before filtering: 349914\n", "Filtered (indels): 0\n", "Filtered (bi-allel): 13001\n", "Filtered (mincov): 222081\n", "Filtered (minmap): 112898\n", "Filtered (combined): 226418\n", "Sites after filtering: 123496\n", "Sites containing missing values: 96001 (77.74%)\n", "Missing values in SNP matrix: 142017 (4.26%)\n", "Parallel connection | oud: 4 cores\n", "[####################] 100% 3:39:43 | running 13 structure jobs \n" ] } ], "source": [ "# init analysis object with same params as previously\n", "struct = ipa.structure(\n", " name=\"test\",\n", " data=data,\n", " imap=imap,\n", " minmap=minmap,\n", " mincov=0.9,\n", ")\n", "\n", "# use the same params as before \n", "struct.mainparams.burnin = 5000\n", "struct.mainparams.numreps = 10000\n", "\n", "# call run for all K values you want to have 5 finished replicates\n", "struct.run(nreps=5, kpop=[2, 3, 4, 5, 6], auto=True)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NrepslnPKlnPPKdeltaKestLnProbMeanestLnProbStdev
250.000.000.000000-254950.121675.280397
3525807.6238146.7639.701878-229142.50960.830118
45-12339.147180.722.018760-241481.643556.995749
55-5158.428885.001.413531-246640.066285.676647
653726.580.000.000000-242913.482164.870641
\n", "
" ], "text/plain": [ " Nreps lnPK lnPPK deltaK estLnProbMean estLnProbStdev\n", "2 5 0.00 0.00 0.000000 -254950.12 1675.280397\n", "3 5 25807.62 38146.76 39.701878 -229142.50 960.830118\n", "4 5 -12339.14 7180.72 2.018760 -241481.64 3556.995749\n", "5 5 -5158.42 8885.00 1.413531 -246640.06 6285.676647\n", "6 5 3726.58 0.00 0.000000 -242913.48 2164.870641" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "struct.get_evanno_table([2, 3, 4, 5, 6])" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 2 }