{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Example notebook: Structure with pop assignments\n", "\n", "This notebook shows how to use the `ipyrad.analysis` toolkit to generate structure input files that use population information. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Required software" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# conda install ipyrad -c ipyrad\n", "# conda install structure clumpp -c ipyrad\n", "# conda install toytree -c eaton-lab" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import ipyrad.analysis as ipa\n", "import toyplot" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create a structure analysis object\n", "If you include a 'mapfile' then we will use locus information to subsample just a single SNP from each locus so that the resulting data file will meet the expectations of structure that SNPs are \"unlinked\". If you create multiple replicates files using different random seeds then different SNPs will be selected in each rep. " ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "s = ipa.structure(\n", " name=\"test\", \n", " workdir=\"analysis-structure\",\n", " data=\"analysis-ipyrad/ped_min10_outfiles/ped_min10.str\",\n", " mapfile=\"analysis-ipyrad/ped_min10_outfiles/ped_min10.snps.map\",\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Set params for the structure analysis\n", "These values are used to generate the \"mainparams\" and \"extraparams\" files for structure. " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "burnin 1000 \n", "extracols 0 \n", "label 1 \n", "locdata 0 \n", "mapdistances 0 \n", "markernames 0 \n", "markovphase 0 \n", "missing -9 \n", "notambiguous -999 \n", "numreps 5000 \n", "onerowperind 0 \n", "phased 0 \n", "phaseinfo 0 \n", "phenotype 0 \n", "ploidy 2 \n", "popdata 1 \n", "popflag 1 \n", "recessivealleles 0 " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## set run parameters (you probably want to run >10X this long)\n", "s.mainparams.burnin = 1000\n", "s.mainparams.numreps = 5000\n", "\n", "## tell structure to expect popdata & popflag\n", "s.mainparams.popdata = 1\n", "s.mainparams.popflag = 1\n", "\n", "## print all mainparams\n", "s.mainparams" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "admburnin 500 \n", "alpha 1.0 \n", "alphamax 10.0 \n", "alphapriora 1.0 \n", "alphapriorb 2.0 \n", "alphapropsd 0.025 \n", "ancestdist 0 \n", "ancestpint 0.9 \n", "computeprob 1 \n", "echodata 0 \n", "fpriormean 0.01 \n", "fpriorsd 0.05 \n", "freqscorr 1 \n", "gensback 2 \n", "inferalpha 1 \n", "inferlambda 0 \n", "intermedsave 0 \n", "lambda_ 1.0 \n", "linkage 0 \n", "locispop 0 \n", "locprior 0 \n", "locpriorinit 1.0 \n", "log10rmax 1.0 \n", "log10rmin -4.0 \n", "log10rpropsd 0.1 \n", "log10rstart -2.0 \n", "maxlocprior 20.0 \n", "metrofreq 10 \n", "migrprior 0.01 \n", "noadmix 0 \n", "numboxes 1000 \n", "onefst 0 \n", "pfrompopflagonly 0 \n", "popalphas 0 \n", "popspecificlambda 0 \n", "printlambda 1 \n", "printlikes 0 \n", "printnet 1 \n", "printqhat 0 \n", "printqsum 1 \n", "randomize 0 \n", "reporthitrate 0 \n", "seed 12345 \n", "sitebysite 0 \n", "startatpopinfo 0 \n", "unifprioralpha 1 \n", "updatefreq 10000 \n", "usepopinfo 1 " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## tell structure to use popinfo\n", "s.extraparams.usepopinfo = 1\n", "\n", "## print all other extraparams\n", "s.extraparams" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### By default the 'header' of the str file is empty" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
labelspopdatapopflaglocdataphenotype
029154_superba
130556_thamno
230686_cyathophylla
332082_przewalskii
433413_thamno
533588_przewalskii
635236_rex
735855_rex
838362_rex
939618_rex
1040578_rex
1141478_cyathophylloides
1241954_cyathophylloides
\n", "
" ], "text/plain": [ " labels popdata popflag locdata phenotype\n", "0 29154_superba \n", "1 30556_thamno \n", "2 30686_cyathophylla \n", "3 32082_przewalskii \n", "4 33413_thamno \n", "5 33588_przewalskii \n", "6 35236_rex \n", "7 35855_rex \n", "8 38362_rex \n", "9 39618_rex \n", "10 40578_rex \n", "11 41478_cyathophylloides \n", "12 41954_cyathophylloides " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s.header" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### You can fill it in by filling the header attribute lists\n", "`popdata` is the *a priori* population assignment of an individual to a population. Assignments should be non-zero integers (e.g., 1, 2, 3). Zero is reserved to mean that there is no *a priori* assignment. `popflag` indicates whether or not to use the population assignment in the analysis (1) or to leave it to be inferred (0). So in the example below seven samples have assigned populations (popflag=1), and six samples will have their population assignments inferred (popflag=0). The popdata information will only be used for the seven individuals with assigned pops. " ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
labelspopdatapopflaglocdataphenotype
029154_superba11
130556_thamno31
230686_cyathophylla10
332082_przewalskii20
433413_thamno30
533588_przewalskii20
635236_rex30
735855_rex30
838362_rex31
939618_rex31
1040578_rex31
1141478_cyathophylloides11
1241954_cyathophylloides11
\n", "
" ], "text/plain": [ " labels popdata popflag locdata phenotype\n", "0 29154_superba 1 1 \n", "1 30556_thamno 3 1 \n", "2 30686_cyathophylla 1 0 \n", "3 32082_przewalskii 2 0 \n", "4 33413_thamno 3 0 \n", "5 33588_przewalskii 2 0 \n", "6 35236_rex 3 0 \n", "7 35855_rex 3 0 \n", "8 38362_rex 3 1 \n", "9 39618_rex 3 1 \n", "10 40578_rex 3 1 \n", "11 41478_cyathophylloides 1 1 \n", "12 41954_cyathophylloides 1 1 " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## assign popdata and popflag by entering a list of values\n", "s.popdata = [1, 3, 1, 2, 3, 2, 3, 3, 3, 3, 3, 1, 1]\n", "s.popflag = [1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1]\n", "\n", "## print the header information\n", "s.header" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Write the input files and run STRUCTURE\n", "This will write the `.str` file (and subsample SNPs if you included a mapfile) with the header information included, and it will write a mainparams and extraparams file with the parameter settings that we entered above. " ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('/home/deren/Documents/ipyrad/tests/analysis-structure/tmp-test-3-1.mainparams.txt',\n", " '/home/deren/Documents/ipyrad/tests/analysis-structure/tmp-test-3-1.extraparams.txt',\n", " '/home/deren/Documents/ipyrad/tests/analysis-structure/tmp-test-3-1.strfile.txt')" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## writes three input files and then call structure yourself\n", "s.write_structure_files(kpop=3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Or, submit jobs directly to the cluster\n", "If you start an ipcluster instance (see other tutorials) you can submit structure jobs directly to the cluster and easily collect the results, like below. " ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "connected to 40 cores\n" ] } ], "source": [ "## connect to a running ipcluster instance\n", "import ipyparallel as ipp\n", "ipyclient = ipp.Client()\n", "print \"connected to {} cores\".format(len(ipyclient))" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "submitted 3 structure jobs [test-K-3]\n" ] } ], "source": [ "## submit job replicates to ipyclient\n", "s.run(kpop=3, nreps=3, ipyclient=ipyclient)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## wait for jobs to finish\n", "ipyclient.wait()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Collect results and plot" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "mean scores across 3 replicates.\n" ] } ], "source": [ "## get table of summarized results\n", "table = s.get_clumpp_table(3)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": true }, "outputs": [], "source": [ "## reorder table by membership in groups\n", "table.sort_values(by=[0, 1, 2], inplace=True)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
30556_thamno33413_thamno35236_rex35855_rex38362_rex39618_rex40578_rex32082_przewalskii33588_przewalskii29154_superba30686_cyathophylla41478_cyathophylloides41954_cyathophylloides0.00.51.0
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "## build barplot\n", "canvas = toyplot.Canvas(width=500, height=250)\n", "axes = canvas.cartesian(bounds=(\"10%\", \"90%\", \"10%\", \"45%\"))\n", "axes.bars(table)\n", "\n", "## add labels to x-axis\n", "ticklabels = [i for i in table.index.tolist()]\n", "axes.x.ticks.locator = toyplot.locator.Explicit(labels=ticklabels)\n", "axes.x.ticks.labels.angle = -60\n", "axes.x.ticks.show = True\n", "axes.x.ticks.labels.offset = 10\n", "axes.x.ticks.labels.style = {\"font-size\": \"12px\"}" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.13" } }, "nbformat": 4, "nbformat_minor": 2 }