{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Simulate a really big data set for testing\n",
    "\n",
    "This is will be used to test memory limits in ipyrad, and also for comparing run times with other software. RAD data sets have three size dimensions that greatly affect run times: the number of taxa, the number of loci, and the length of reads. Here we will set all three to quite large values, by simulating paired-end 150bp reads for 300 taxa for 100K loci. For simplicity we do not simulate any missing data, but this is not expected to affect run times or memory limits except to the extent that it increased the total number of loci in the data set. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "## import Python libraries\n",
    "import ipyrad as ip"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Process is terminated.\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "\n",
    "## this will take about XX minutes to run, sorry, the code is not parallelized\n",
    "## we simulate 360 tips by using the default 12 taxon tree and requesting 40\n",
    "## individuals per taxon. Default is theta=0.002. Crown age= 5*2Nu (check this)\n",
    "simrrls -L 100000 -f pairddrad -l 150 -n 30 -o Big_i360_L100K\n",
    "\n",
    "## because it takes a long time to simulate this data set, you can alternatively\n",
    "## just download the data set that we already simulated using the code above. \n",
    "## The data set is hosted on anaconda, just run the following to get it. \n",
    "# conda download -c ipyrad bigData"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load the data set with *ipyrad*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  New Assembly: bigHorsePaired\n",
      "  0   assembly_name               bigHorsePaired                               \n",
      "  1   project_dir                 ./bigdata                                    \n",
      "  2   raw_fastq_path              ./bigHorsePaired_R*_.fastq.gz                \n",
      "  3   barcodes_path               ./bigHorsePaired_barcodes.txt                \n",
      "  4   sorted_fastq_path                                                        \n",
      "  5   assembly_method             denovo                                       \n",
      "  6   reference_sequence                                                       \n",
      "  7   datatype                    pairddrad                                    \n",
      "  8   restriction_overhang        ('TGCAG', 'CCGG')                            \n",
      "  9   max_low_qual_bases          5                                            \n",
      "  10  phred_Qscore_offset         33                                           \n",
      "  11  mindepth_statistical        6                                            \n",
      "  12  mindepth_majrule            6                                            \n",
      "  13  maxdepth                    10000                                        \n",
      "  14  clust_threshold             0.85                                         \n",
      "  15  max_barcode_mismatch        0                                            \n",
      "  16  filter_adapters             0                                            \n",
      "  17  filter_min_trim_len         35                                           \n",
      "  18  max_alleles_consens         2                                            \n",
      "  19  max_Ns_consens              (5, 5)                                       \n",
      "  20  max_Hs_consens              (8, 8)                                       \n",
      "  21  min_samples_locus           4                                            \n",
      "  22  max_SNPs_locus              (20, 20)                                     \n",
      "  23  max_Indels_locus            (8, 8)                                       \n",
      "  24  max_shared_Hs_locus         0.5                                          \n",
      "  25  edit_cutsites               (0, 0)                                       \n",
      "  26  trim_overhang               (0, 0, 0, 0)                                 \n",
      "  27  output_formats              ['l', 'p', 's', 'v']                         \n",
      "  28  pop_assign_file                                                          \n"
     ]
    }
   ],
   "source": [
    "data = ip.Assembly(\"bigHorsePaired\")\n",
    "\n",
    "data.set_params(\"project_dir\", \"bigdata\")\n",
    "data.set_params(\"raw_fastq_path\", \"bigHorsePaired_R*_.fastq.gz\")\n",
    "data.set_params(\"barcodes_path\", \"bigHorsePaired_barcodes.txt\")\n",
    "data.set_params(\"datatype\", \"pairddrad\")\n",
    "data.set_params(\"restriction_overhang\", (\"TGCAG\", \"CCGG\"))\n",
    "data.get_params()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Demultiplex the data files\n",
    "All of the software methods we are comparing will use the same demultiplexed data files, so we will not count this step towards the total run times comparisons. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "data.run('1')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The data files\n",
    "These are the demultiplexed data files that each method will analyze. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "ls -l bigdata/bigHorsePaired_fastqs/"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Assemble the data set with *ipyrad*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "data.run(\"234567\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Assemble the data set with *stacks*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Assemble the data set with *ddocent*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Assemble the data set with AftrRAD\n",
    "We do not attempt this, since this software was unable to assemble much smaller data sets. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Compare run times"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [conda root]",
   "language": "python",
   "name": "conda-root-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}