{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Live oak introgression manuscript\n",
    "## Notebook 1: RADseq Data Assembly  \n",
    "\n",
    "\n",
    "#### D. Eaton, A. Hipp, A. Gonzalez-Rodriguez & J. Cavender-Bares\n",
    "##### contact: deren.eaton@yale.edu  \n",
    "\n",
    "-----------------------   "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### This notebook  \n",
    "This is an IPython notebook, a tool for reproducible science. It can be downloaded and executed to run all scripts in the notebook, or viewed as 'read-only' through a browser. The default language in each cell is IPython except where indicated with \"%%bash\" or \"!\", indicating bash scripts. \n",
    "\n",
    "-------------------------  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "## create directories for data assembly\n",
    "mkdir -p analysis_pyrad/fastq/  \n",
    "mkdir -p figures/"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Download the sequence data\n",
    "\n",
    "Sequence data for this study is archived on the NCBI sequence read archive (SRA). The data were run in two separate Illumina runs, but are combined under a single project id number. \n",
    "\n",
    "Project SRA: SRP055977  \n",
    "Project number: PRJNA277574  \n",
    "Biosample numbers: SAMN03394519 -- SAMN03394561  \n",
    "Runs: SRR1915524 -- SRR1915566"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The first data set contains 30 samples sequenced on an Illumina GAIIx for 100bp single-end reads, and all sample names from this data set end with \"\\_v\". The second data set includes 15 samples sequenced on an Illumina HiSEQ2000 for 100bp single-end reads, and samples from this data set end with \"\\_re\". The quality scores are offset by 64 on the first data and by 33 on the second. Of the 15 samples in the second data set, 7 are re-sequenced individuals from the first, and thus after being filtered through step 2 of the _pyRAD_ analysis are combined with the replicates from the first data set, and clustered together in step 3. This yields a total of 37 samples in the data set. The sample \"VI1\" from the Hipp data set does not have voucher information and was thus excluded from the analysis. The re-sequenced sample \"CRL0001\" from the Hipp data set was suspected to be contaminated and was also excluded. The total number of samples is 37. \n",
    "\n",
    "You can download the data set using the script below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 77,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "## get the SRA format files\n",
    "for run in $(seq 24 28);\n",
    "\n",
    "  do\n",
    "wget -q -r -nH --cut-dirs=9 \\\n",
    "ftp://ftp-trace.ncbi.nlm.nih.gov/\\\n",
    "sra/sra-instant/reads/ByRun/sra/SRR/\\\n",
    "SRR191/SRR19155$run/SRR19155$run\".sra\";\n",
    "\n",
    "done"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 78,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Read 4046890 spots for SRR1915524.sra\n",
      "Written 4046890 spots for SRR1915524.sra\n",
      "Read 5352627 spots for SRR1915525.sra\n",
      "Written 5352627 spots for SRR1915525.sra\n",
      "Read 4715624 spots for SRR1915526.sra\n",
      "Written 4715624 spots for SRR1915526.sra\n",
      "Read 4539385 spots for SRR1915527.sra\n",
      "Written 4539385 spots for SRR1915527.sra\n",
      "Read 3742953 spots for SRR1915528.sra\n",
      "Written 3742953 spots for SRR1915528.sra\n",
      "Read 22397479 spots total\n",
      "Written 22397479 spots total\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "## convert sra files to fastq using fastq-dump tool\n",
    "fastq-dump *.sra"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Rename fastq files to have sample ID names instead of SRR run IDs. The following script will replace the numbers with the correct names."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 81,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "## This reads in a table mapping sample names to SRA numbers\n",
    "## that is hosted on github\n",
    "import pandas\n",
    "import urllib2\n",
    "import glob\n",
    "import os\n",
    "\n",
    "## open table from github url\n",
    "url = \"https://raw.githubusercontent.com/\"+\\\n",
    "      \"dereneaton/virentes/master/SraRunTable.txt\"\n",
    "intable = urllib2.urlopen(url)\n",
    "\n",
    "## make name xfer dictionary\n",
    "DF = pandas.read_table(intable, sep=\"\\t\")\n",
    "D = {DF.Run_s[i]:DF.Library_Name_s[i] for i in DF.index}\n",
    "\n",
    "## change file names and move to fastq dir/\n",
    "for fname in glob.glob(\"*.fastq\"):\n",
    "    os.rename(fname, \"analysis_pyrad/fastq/\"+\\\n",
    "                     D[fname.replace(\".fastq\",\".fq\")])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# _pyRAD_ assembly of _de novo_ RADseq loci\n",
    "\n",
    "RADseq data were filtered and clustered using the program _pyRAD_ v.2.13. Below are the scripts for generating the params file and entering the parameters used in the study. Because the data are already de-multiplexed (sorted by barcodes), a barcodes files is not needed, and we begin the analysis from step 2 of _pyRAD_. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Creating a params input file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "## create template params file\n",
    "~/pyrad_v.2.13/pyRAD -n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "## substitute new parameters into file\n",
    "sed -i '/## 1. /c\\analysis_pyrad         ## 1. working directory ' params.txt\n",
    "sed -i '/## 2. /c\\                       ## 2. data loc          ' params.txt\n",
    "sed -i '/## 3. /c\\                       ## 3. barcode loc       ' params.txt\n",
    "sed -i '/## 7. /c\\12                     ## 7. N processors      ' params.txt\n",
    "sed -i '/## 9. /c\\6                      ## 9. NQual             ' params.txt\n",
    "sed -i '/## 10./c\\.85                    ## 10. clust threshold  ' params.txt\n",
    "sed -i '/## 13./c\\5                      ## 13. maxSH            ' params.txt\n",
    "sed -i '/## 14./c\\virentes_c85d6m4p5     ## 14. output name      ' params.txt\n",
    "sed -i '/## 21./c\\1                      ## 21. Filter type      ' params.txt\n",
    "sed -i '/## 26./c\\20                     ## 26. MaxSNPs          ' params.txt\n",
    "sed -i '/## 29./c\\2,2                    ## 29. trim overhang    ' params.txt\n",
    "sed -i '/## 30./c\\u,n,k                  ## 30. output formats   ' params.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "## my modified template params file\n",
    "! cat params.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Filtering the data\n",
    "\n",
    "I set the location of the de-multiplexed data to select only the '\\_re' data set first, then I run step 2 of _pyRAD_ to filter these data based on quality scores with the default offset score of 33, and also to remove reads with Illumina adapters detected. After this I change the data location to select only the '\\_v' data set files, and run step 2 with the offset score changed to 64.   "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "sed -i '/## 18./c\\analysis_pyrad/fastq/*_re.fastq  ## 18. select re data files ' params.txt  \n",
    "sed -i '/## 20./c\\33              ## 20. using default 33       ' params.txt\n",
    "~/pyrad_v.2.13/pyRAD -p params.txt -s 2\n",
    "\n",
    "sed -i '/## 18./c\\analysis_pyrad/fastq/*_v.fastq   ## 18. select v data files ' params.txt\n",
    "sed -i '/## 20./c\\64              ## 20. change offset to 64       ' params.txt\n",
    "~/pyrad_v.2.13/pyRAD -p params.txt -s 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "## remove the two samples that we are excluding\n",
    "rm analysis_pyrad/edits/VI1_re.edit\n",
    "rm analysis_pyrad/edits/CRL0001_re.edit"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "## IPython code to concatenate data of re-sequenced individuals\n",
    "\n",
    "import glob\n",
    "\n",
    "edits = glob.glob(\"analysis_pyrad/edits/*.edit\")\n",
    "names = [\"_\".join(i.split(\"_\")[:-1]) for i in edits]\n",
    "\n",
    "for name in set(names): \n",
    "    if names.count(name) == 2:\n",
    "        cmd = \"cat %s_re.edit %s_v.edit > %s.edit\" % (name,name,name)\n",
    "        stderr = ! $cmd\n",
    "        cmd = \"rm %s_re.edit %s_v.edit\" % (name,name)\n",
    "        stderr = ! $cmd\n",
    "    else:\n",
    "        cmd = \"mv %s_re.edit %s.edit\" % (name,name)\n",
    "        stderr = ! $cmd\n",
    "        cmd = \"mv %s_v.edit %s.edit\" % (name,name)\n",
    "        stderr = ! $cmd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "## remove _re and _v endings from sequence names\n",
    "## this makes it cleaner for combining re-sequenced samples\n",
    "for editfile in glob.glob(\"analysis_pyrad/edits/*.edit\"):\n",
    "    cmd = \"sed -i s/%s/_/g %s\" % (\"_re_\",editfile)\n",
    "    ! $cmd\n",
    "    cmd = \"sed -i s/%s/_/g %s\" % (\"_v_\",editfile)\n",
    "    ! $cmd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Clustering within samples"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "~/pyrad_v.2.13/pyRAD -p params.txt -s 3"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Calling consensus sequences"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "~/pyrad_v.2.13/pyRAD -p params.txt -s 45"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Exclude small samples\n",
    "Because the two samples TXWV2 and CUMM5 have considerably less data than all other samples we chose to exclude them at this step before clustering across samples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "rm analysis_pyrad/clust.85/TXWV2.consens.gz\n",
    "rm analysis_pyrad/clust.85/CUMM5.consens.gz"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Clustering across samples \n",
    "This is the most time-consuming step. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "~/pyrad_v.2.13/pyRAD -p params.txt -s 6"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Outputting assembled data sets\n",
    "\n",
    "By changing the parameters on lines 13,14, and by subselecting taxa using line 19 of the params file, I created several data sets that include/exclude different samples, and vary in their proportions of missing data.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "## The largest, all inclusive, and most missing data, data set (All_min4)\n",
    "sed -i '/## 12./c\\4                        ## 12. MinCov           ' params.txt  \n",
    "sed -i '/## 14./c\\virentes_c85d6m4p5       ## 14. output name      ' params.txt\n",
    "sed -i '/## 17./c\\                         ## 17. exclude samples  ' params.txt\n",
    "~/pyrad_v.2.13/pyRAD -p params.txt -s 7"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "## Smaller, all inclusive, less missing data, data set (All_min20)\n",
    "sed -i '/## 12./c\\20                        ## 12. MinCov           ' params.txt  \n",
    "sed -i '/## 14./c\\virentes_c85d6m20p5       ## 14. output name      ' params.txt\n",
    "sed -i '/## 17./c\\                          ## 17. exclude samples  ' params.txt\n",
    "~/pyrad_v.2.13/pyRAD -p params.txt -s 7"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "## excluding outgroups for using in structure\n",
    "sed -i '/## 12./c\\20                             ## 12. MinCov           ' params.txt  \n",
    "sed -i '/## 14./c\\virentes_c85d6m20p5noutg       ## 14. output name      ' params.txt\n",
    "sed -i '/## 17./c\\AR,EN,DO,DU,CH,NI,HE,EN        ## 17. exclude samples  ' params.txt\n",
    "~/pyrad_v.2.13/pyRAD -p params.txt -s 7"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "## Only MGV clade and outgroups, low missing and high missing data sets (MGV_min4, MGV_min16)\n",
    "sed -i '/## 12./c\\4                         ## 12. MinCov           ' params.txt  \n",
    "sed -i '/## 14./c\\MGV_c85d6m4p5             ## 14. output name      ' params.txt\n",
    "sed -i '/## 17./c\\BJSB3,BJSL25,BJVL19,BZBB1,CRL0001,CRL0030,CUCA4,CUSV6,CUVN10,HNDA09,MXED8,MXGT4,MXSA3017,TXGR3,TXMD3    ## 17. exclude samples' params.txt\n",
    "~/pyrad_v.2.13/pyRAD -p params.txt -s 7\n",
    "\n",
    "sed -i '/## 12./c\\16                        ## 12. MinCov           ' params.txt\n",
    "sed -i '/## 14./c\\MGV_c85d6m16p5            ## 14. output name      ' params.txt\n",
    "~/pyrad_v.2.13/pyRAD -p params.txt -s 7"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "## Only OS clade and outgroups, low missing and high missing data sets (OS_min4, OS_min14)\n",
    "sed -i '/## 12./c\\4                        ## 12. MinCov           ' params.txt  \n",
    "sed -i '/## 14./c\\OS_c85d6m4p5             ## 14. output name      ' params.txt\n",
    "sed -i '/## 17./c\\BJSB3,BJSL25,BJVL19,MXED8,MXGT4,TXGR3,TXMD3,FLAB109,FLBA140,FLCK18,FLCK216,FLMO62,FLSA185,FLSF33,FLSF47,FLSF54,FLWO6,LALC2,SCCU3     ## 17. exclude samples' params.txt\n",
    "~/pyrad_v.2.13/pyRAD -p params.txt -s 7\n",
    "\n",
    "sed -i '/## 12./c\\13                        ## 12. MinCov           ' params.txt  \n",
    "sed -i '/## 14./c\\OS_c85d6m13p5             ## 14. output name      ' params.txt\n",
    "~/pyrad_v.2.13/pyRAD -p params.txt -s 7"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "## Only OSMGV clade and outgroups, low missing and high missing data sets (OSMGV_min4, OSMGV_min16)\n",
    "sed -i '/## 12./c\\4                           ## 12. MinCov           ' params.txt  \n",
    "sed -i '/## 14./c\\OSMGV_c85d6m4p5             ## 14. output name      ' params.txt\n",
    "sed -i '/## 17./c\\BJSB3,BJSL25,BJVL19,MXED8,MXGT4,TXGR3,TXMD3     ## 17. exclude samples' params.txt\n",
    "~/pyrad_v.2.13/pyRAD -p params.txt -s 7\n",
    "\n",
    "sed -i '/## 12./c\\20                           ## 12. MinCov           ' params.txt  \n",
    "sed -i '/## 14./c\\OSMGV_c85d6m20p5             ## 14. output name      ' params.txt\n",
    "~/pyrad_v.2.13/pyRAD -p params.txt -s 7"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "## Only FB clade and outgroups, low missing and high missing data sets (FB_min4, FB_min16)\n",
    "sed -i '/## 12./c\\4                           ## 12. MinCov           ' params.txt  \n",
    "sed -i '/## 14./c\\FB_c85d6m4p5                ## 14. output name      ' params.txt\n",
    "sed -i '/## 17./c\\FLAB109,FLBA140,FLCK18,FLCK216,FLMO62,FLSA185,FLSF33,FLSF47,FLSF54,FLWO6,LALC2,SCCU3,BZBB1,CRL0001,CRL0030,CUCA4,CUSV6,CUVN10,HNDA09,MXSA3017    ## 17. exclude samples' params.txt\n",
    "~/pyrad_v.2.13/pyRAD -p params.txt -s 7\n",
    "\n",
    "sed -i '/## 12./c\\12                           ## 12. MinCov           ' params.txt  \n",
    "sed -i '/## 14./c\\FB_c85d6m12p5                ## 14. output name      ' params.txt\n",
    "~/pyrad_v.2.13/pyRAD -p params.txt -s 7"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}