{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Live oaks introgression manuscript\n",
    "## Notebook 3: Introgression analyses \n",
    "\n",
    "#### D. Eaton, A. Hipp, A. Gonzalez-Rodriguez & J. Cavender-Bares\n",
    "##### contact: deren.eaton@yale.edu  \n",
    "\n",
    "-----------------------   \n",
    "\n",
    "This notebook shows the creation of input files for performing D-statistic tests, runs the tests in pyRAD, and creates tables of results. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "mkdir -p analysis_dtests"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import scipy.stats\n",
    "import itertools\n",
    "import gzip\n",
    "from collections import OrderedDict, Counter"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "--------------------------  \n",
    "\n",
    "## D-statistic Introgression analyses\n",
    "\n",
    "Four taxon D-statistic tests were performed on the \".loci\" output of the largest data set (virentes\\_c85d6m4p5.loci). Input files for D-statistic tests designate which samples to iterate over for each test. Below I wrote a Python function to create these input files, and then I perform the tests by designating input files to _pyRAD_ using the -d tag. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "## create D-statistic template input file\n",
    "! ~/pyrad_v.2.13/pyRAD -D  >  analysis_dtests/input.template"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "200                          ## N bootstrap replicates\r\n",
      "test.loci                    ## loc/path to input .loci file\r\n",
      "dstats/test1_res             ## output file path/name (no suffix)\r\n",
      "4                            ## which test: 4,part,foil,foilalt\r\n",
      "2                            ## N cores (execute jobs [lines below] in parallel\r\n",
      "0                            ## output ABBA/BABA loci to files (0=no,1,2=verbose)\r\n",
      "0                            ## output bootstrap Ds to files (0=no,1=yes)\r\n",
      "-----------------------------------------------------------\r\n"
     ]
    }
   ],
   "source": [
    "! cat analysis_dtests/input.template"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "## substitute in relevant parameters to template file\n",
    "sed -i \"/## N boot/c\\200                           ## N bootstrap reps \"   analysis_dtests/input.template\n",
    "sed -i \"/## loc/c\\analysis_pyrad/outfiles/virentes_c85d6m4p5.loci   ## loc/path \"   analysis_dtests/input.template\n",
    "sed -i \"/## output file/c\\analysis_dtests/out      ## output file\" analysis_dtests/input.template\n",
    "sed -i \"/## N cores/c\\20                           ## N cores (parallel) \"  analysis_dtests/input.template\n",
    "sed -i \"/## output ABBA/c\\0                        ## output ABBA/BABA \" analysis_dtests/input.template\n",
    "sed -i \"/## output bootstrap/c\\0                   ## output bootstraps \" analysis_dtests/input.template"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "## Python function for editing input files\n",
    "## Takes a list of sample names for each taxon\n",
    "## and writes iterations to file, uses template file\n",
    "\n",
    "def makeD4(P1,P2,P3,O,num,name,template):\n",
    "    stringout = \"\"\n",
    "    tests = []\n",
    "    i = 1\n",
    "    for p1 in P1: \n",
    "        for p2 in P2:\n",
    "            for p3 in P3:\n",
    "                if p1 != p2:\n",
    "                    if set([p1,p2,p3]) not in tests:\n",
    "                        stringout += \"[%s]\\t[%s]\\t[%s]\\t[%s]\\t## test %i.%i\\n\" % (p1,p2,p3,\",\".join(O),num,i)\n",
    "                        i += 1\n",
    "                        tests.append(set([p1,p2,p3]))\n",
    "    ! head -n 8 $template > \"analysis_dtests/input.\"$num\".\"$name\".txt\"\n",
    "    ! echo \"$stringout\" >> \"analysis_dtests/input.\"$num\".\"$name\".txt\"\n",
    "    ostring = \"/## output file/c\\\\analysis_dtests/out.\"+str(num)+\".\"+str(name)+\"      ## output file \"\n",
    "    ! sed -i \"$ostring\" \"analysis_dtests/input.\"$num\".\"$name\".txt\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "## create sample to taxon lists\n",
    "geminata = [\"FLAB109\",\"FLWO6\",\"FLSF54\",\"FLCK18\"]\n",
    "minima = [\"FLSF47\",\"FLMO62\",\"FLSA185\",\"FLCK216\"]\n",
    "virginiana = [\"LALC2\",\"FLSF33\",\"FLBA140\"]   ## excluded SCCU3 for too little data\n",
    "oleoides = [\"MXSA3017\",\"BZBB1\",\"HNDA09\",\"CRL0001\",\"CRL0030\"]\n",
    "sagraena = [\"CUSV6\",\"CUVN10\",\"CUCA4\"]\n",
    "brandegei = [\"BJVL19\",\"BJSL25\",\"BJSB3\"]\n",
    "fusiformis_t = [\"TXMD3\",\"TXGR3\"]\n",
    "fusiformis_m = [\"MXED8\",\"MXGT4\"]\n",
    "whiteoaks = [\"AR\",\"EN\",\"DO\",\"DU\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create input files for D-statistic tests and perform tests"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "## write the input file for test 1, testing introgression between minima and geminata\n",
    "makeD4(geminata,geminata,minima,whiteoaks,1,\"GGM\",\"analysis_dtests/input.template\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "200                           ## N bootstrap reps \r\n",
      "analysis_pyrad/outfiles/virentes_c85d6m4p5.loci   ## loc/path \r\n",
      "analysis_dtests/out.1.GGM      ## output file \r\n",
      "4                            ## which test: 4,part,foil,foilalt\r\n",
      "20                           ## N cores (parallel) \r\n",
      "0                        ## output ABBA/BABA \r\n",
      "0                   ## output bootstraps \r\n",
      "-----------------------------------------------------------\r\n",
      "[FLAB109]\t[FLWO6]\t[FLSF47]\t[AR,EN,DO,DU]\t## test 1.1\r\n",
      "[FLAB109]\t[FLWO6]\t[FLMO62]\t[AR,EN,DO,DU]\t## test 1.2\r\n",
      "[FLAB109]\t[FLWO6]\t[FLSA185]\t[AR,EN,DO,DU]\t## test 1.3\r\n",
      "[FLAB109]\t[FLWO6]\t[FLCK216]\t[AR,EN,DO,DU]\t## test 1.4\r\n",
      "[FLAB109]\t[FLSF54]\t[FLSF47]\t[AR,EN,DO,DU]\t## test 1.5\r\n",
      "[FLAB109]\t[FLSF54]\t[FLMO62]\t[AR,EN,DO,DU]\t## test 1.6\r\n",
      "[FLAB109]\t[FLSF54]\t[FLSA185]\t[AR,EN,DO,DU]\t## test 1.7\r\n",
      "[FLAB109]\t[FLSF54]\t[FLCK216]\t[AR,EN,DO,DU]\t## test 1.8\r\n",
      "[FLAB109]\t[FLCK18]\t[FLSF47]\t[AR,EN,DO,DU]\t## test 1.9\r\n",
      "[FLAB109]\t[FLCK18]\t[FLMO62]\t[AR,EN,DO,DU]\t## test 1.10\r\n",
      "[FLAB109]\t[FLCK18]\t[FLSA185]\t[AR,EN,DO,DU]\t## test 1.11\r\n",
      "[FLAB109]\t[FLCK18]\t[FLCK216]\t[AR,EN,DO,DU]\t## test 1.12\r\n",
      "[FLWO6]\t[FLSF54]\t[FLSF47]\t[AR,EN,DO,DU]\t## test 1.13\r\n",
      "[FLWO6]\t[FLSF54]\t[FLMO62]\t[AR,EN,DO,DU]\t## test 1.14\r\n",
      "[FLWO6]\t[FLSF54]\t[FLSA185]\t[AR,EN,DO,DU]\t## test 1.15\r\n",
      "[FLWO6]\t[FLSF54]\t[FLCK216]\t[AR,EN,DO,DU]\t## test 1.16\r\n",
      "[FLWO6]\t[FLCK18]\t[FLSF47]\t[AR,EN,DO,DU]\t## test 1.17\r\n",
      "[FLWO6]\t[FLCK18]\t[FLMO62]\t[AR,EN,DO,DU]\t## test 1.18\r\n",
      "[FLWO6]\t[FLCK18]\t[FLSA185]\t[AR,EN,DO,DU]\t## test 1.19\r\n",
      "[FLWO6]\t[FLCK18]\t[FLCK216]\t[AR,EN,DO,DU]\t## test 1.20\r\n",
      "[FLSF54]\t[FLCK18]\t[FLSF47]\t[AR,EN,DO,DU]\t## test 1.21\r\n",
      "[FLSF54]\t[FLCK18]\t[FLMO62]\t[AR,EN,DO,DU]\t## test 1.22\r\n",
      "[FLSF54]\t[FLCK18]\t[FLSA185]\t[AR,EN,DO,DU]\t## test 1.23\r\n",
      "[FLSF54]\t[FLCK18]\t[FLCK216]\t[AR,EN,DO,DU]\t## test 1.24\r\n",
      "\r\n"
     ]
    }
   ],
   "source": [
    "## This is an example of what an input test file looks like\n",
    "## it iterates over all 'iterations' (sample combinations) in a 'test'\n",
    "! cat analysis_dtests/input.1.GGM.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "## create remaining input files for D4 tests\n",
    "makeD4(geminata,geminata,minima,whiteoaks,1,\"GGM\",\"analysis_dtests/input.template\")\n",
    "makeD4(minima,minima,geminata,whiteoaks,2,\"MMG\",\"analysis_dtests/input.template\")\n",
    "makeD4(geminata,geminata,virginiana,whiteoaks,3,\"GGV\",\"analysis_dtests/input.template\")\n",
    "makeD4(minima,minima,virginiana,whiteoaks,4,\"MMV\",\"analysis_dtests/input.template\")\n",
    "makeD4(minima,geminata,virginiana,whiteoaks,5,\"MGV\",\"analysis_dtests/input.template\")\n",
    "makeD4(virginiana,virginiana,minima,whiteoaks,6,\"VVM\",\"analysis_dtests/input.template\")\n",
    "makeD4(virginiana,virginiana,geminata,whiteoaks,7,\"VVG\",\"analysis_dtests/input.template\")\n",
    "makeD4(virginiana,geminata,minima,whiteoaks,8,\"VGM\",\"analysis_dtests/input.template\")\n",
    "makeD4(oleoides,sagraena,minima+geminata+virginiana,whiteoaks,9,\"OSmgv\",\"analysis_dtests/input.template\")\n",
    "makeD4(sagraena,minima+geminata+virginiana,oleoides,whiteoaks,10,\"SmgvO\",\"analysis_dtests/input.template\")\n",
    "makeD4(minima+geminata+virginiana,oleoides,sagraena,whiteoaks,11,\"mgvOS\",\"analysis_dtests/input.template\")\n",
    "makeD4(oleoides, oleoides,fusiformis_t+fusiformis_m,whiteoaks,12,\"OOF\",\"analysis_dtests/input.template\")\n",
    "makeD4(oleoides, oleoides,brandegei,whiteoaks,13,\"OOB\",\"analysis_dtests/input.template\")\n",
    "makeD4(brandegei, fusiformis_t+fusiformis_m, oleoides,whiteoaks,14,\"BFO\",\"analysis_dtests/input.template\")\n",
    "makeD4(brandegei, fusiformis_t+fusiformis_m, sagraena,whiteoaks,15,\"BFS\",\"analysis_dtests/input.template\")\n",
    "makeD4(brandegei, fusiformis_t+fusiformis_m, minima+geminata+virginiana,whiteoaks,16,\"BFmgv\",\"analysis_dtests/input.template\")\n",
    "makeD4(brandegei, brandegei, fusiformis_t+fusiformis_m,whiteoaks,17,\"BBF\",\"analysis_dtests/input.template\")\n",
    "makeD4(sagraena, sagraena, minima+geminata+virginiana,whiteoaks,18,\"SSmgv\",\"analysis_dtests/input.template\")\n",
    "makeD4(minima, virginiana,sagraena,whiteoaks,19,\"MVS\",\"analysis_dtests/input.template\")\n",
    "makeD4(minima, geminata,sagraena,whiteoaks,20,\"MGS\",\"analysis_dtests/input.template\")\n",
    "makeD4(virginiana, geminata,sagraena,whiteoaks,21,\"VGS\",\"analysis_dtests/input.template\")\n",
    "makeD4(minima+geminata,virginiana,fusiformis_t,whiteoaks,22,\"mgVFt\",\"analysis_dtests/input.template\")\n",
    "makeD4(oleoides,oleoides,fusiformis_m,whiteoaks,23,\"OOFm\",\"analysis_dtests/input.template\")\n",
    "makeD4(sagraena, sagraena, oleoides ,whiteoaks,24,\"SSO\",\"analysis_dtests/input.template\")\n",
    "makeD4(oleoides, oleoides, sagraena ,whiteoaks,25,\"OOS\",\"analysis_dtests/input.template\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "## run tests 1-26\n",
    "for test in range(1,26):\n",
    "    t = glob.glob(\"analysis_dtests/input.\"+str(test)+\".*\")[0]\n",
    "    stderr = ! ~/Dropbox/pyrad-git/pyRAD -d $t"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Partitioned D-statistics\n",
    "The partitioned test statistics can be used to identify false positive D4 statistics that arise due to shared ancestry. I investigate selected tests that yielded significant D4 statistics."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "## Python function for creating input files\n",
    "## from a list of samples in each taxon\n",
    "## and writes iterations to outfile, based on template file\n",
    "def makeD5(P1,P2,P3a,P3b,O,num,name,template):\n",
    "    stringout = \"\"\n",
    "    tests = []\n",
    "    i = 1\n",
    "    for p1 in P1: \n",
    "        for p2 in P2:\n",
    "            for p3a in P3a:\n",
    "                for p3b in P3b:\n",
    "                    if p1 != p2:\n",
    "                        if set([p1,p2,p3a,p3b]) not in tests:\n",
    "                            stringout += \"[%s]\\t[%s]\\t[%s]\\t[%s]\\t[%s]\\t## test %i.%i\\n\" % (p1,p2,p3a,p3b,\",\".join(O),num,i)\n",
    "                            i += 1\n",
    "                            tests.append(set([p1,p2,p3a,p3b]))\n",
    "    ! head -n 8 $template > \"analysis_dtests/input.\"$num\".\"$name\".txt\"\n",
    "    ! echo \"$stringout\" >> \"analysis_dtests/input.\"$num\".\"$name\".txt\"\n",
    "    ostring = \"/## output file/c\\\\analysis_dtests/out.\"+str(num)+\".\"+str(name)+\"      ## output file \"\n",
    "    ! sed -i \"$ostring\" \"analysis_dtests/input.\"$num\".\"$name\".txt\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "## re-define sample to taxon lists to select only \n",
    "## the most variable individuals from D4 tests to reduce redundancy\n",
    "geminata = [\"FLCK18\"]\n",
    "minima = [\"FLSA185\"]\n",
    "virg_l = [\"LALC2\"]\n",
    "virg_f = [\"FLBA140\"]  \n",
    "oleo_m = [\"MXSA3017\"]\n",
    "oleo_h = [\"HNDA09\"]\n",
    "sagraena = [\"CUVN10\"]\n",
    "brandegei = [\"BJVL19\"]\n",
    "fusiformis_t = [\"TXMD3\"]\n",
    "fusiformis_m = [\"MXED8\"]\n",
    "whiteoaks = [\"AR\",\"EN\",\"DO\",\"DU\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I set the test type to 'part' for partitioned and create input files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "! sed  -i \"/## which/c\\part                       ## which test \" analysis_dtests/input.template\n",
    "\n",
    "## gene flow between clades (Fig.2)\n",
    "makeD5(brandegei, fusiformis_t, oleo_m+oleo_h, virg_f+virg_l, whiteoaks, 26, \"BFtOV\", \"analysis_dtests/input.template\")\n",
    "makeD5(brandegei, fusiformis_m, oleo_m+oleo_h, virg_f+virg_l, whiteoaks, 27, \"BFmOV\", \"analysis_dtests/input.template\")\n",
    "makeD5(oleo_m+oleo_h, virg_f+virg_l, fusiformis_t, brandegei, whiteoaks, 28, \"OVFtB\", \"analysis_dtests/input.template\")\n",
    "makeD5(oleo_m+oleo_h, virg_f+virg_l, fusiformis_m, brandegei, whiteoaks, 29, \"OVFmB\", \"analysis_dtests/input.template\")\n",
    "\n",
    "makeD5(brandegei, fusiformis_t, sagraena, virg_f+virg_l, whiteoaks, 30, \"BFtSV\", \"analysis_dtests/input.template\")\n",
    "makeD5(brandegei, fusiformis_m, sagraena, virg_f+virg_l, whiteoaks, 31, \"BFmSV\", \"analysis_dtests/input.template\")\n",
    "makeD5(oleo_m+oleo_h, sagraena, fusiformis_t, brandegei, whiteoaks, 32, \"SVFtB\", \"analysis_dtests/input.template\")\n",
    "makeD5(oleo_m+oleo_h, sagraena, fusiformis_m, brandegei, whiteoaks, 33, \"SVFmB\", \"analysis_dtests/input.template\")\n",
    "\n",
    "makeD5(brandegei, fusiformis_t, oleo_m+oleo_h, sagraena, whiteoaks, 34, \"BFtOS\", \"analysis_dtests/input.template\")\n",
    "makeD5(brandegei, fusiformis_m, oleo_m+oleo_h, sagraena, whiteoaks, 35, \"BFmOS\", \"analysis_dtests/input.template\")\n",
    "makeD5(oleo_m+oleo_h, sagraena, fusiformis_t, brandegei, whiteoaks, 36, \"OSFtB\", \"analysis_dtests/input.template\")\n",
    "makeD5(oleo_m+oleo_h, sagraena, fusiformis_m, brandegei, whiteoaks, 37, \"OSFmB\", \"analysis_dtests/input.template\")\n",
    "\n",
    "## testing cuba hypothesis 1, contrasts oleoides\n",
    "makeD5(oleo_m, oleo_h, sagraena, minima, whiteoaks, 38, \"OOSM\", \"analysis_dtests/input.template\")\n",
    "makeD5(minima, sagraena, oleo_h, oleo_m, whiteoaks, 39, \"MSO0\", \"analysis_dtests/input.template\")\n",
    "makeD5(oleo_m, oleo_h, sagraena, geminata, whiteoaks, 40, \"OOSG\", \"analysis_dtests/input.template\")\n",
    "makeD5(geminata, sagraena, oleo_h, oleo_m, whiteoaks, 41, \"GSO0\", \"analysis_dtests/input.template\")\n",
    "makeD5(oleo_m, oleo_h, sagraena, virg_l+virg_f, whiteoaks, 42, \"OOSV\", \"analysis_dtests/input.template\")\n",
    "makeD5(virg_l+virg_f, sagraena, oleo_h, oleo_m, whiteoaks, 43, \"VSO0\", \"analysis_dtests/input.template\")\n",
    "\n",
    "## testing cuba hypothesis 2, contrasts Florida oaks\n",
    "makeD5(minima, virg_l+virg_f, sagraena, oleo_h, whiteoaks, 44, \"MVS0\", \"analysis_dtests/input.template\")\n",
    "makeD5(oleo_h, sagraena, virg_l+virg_f, minima, whiteoaks, 45, \"OSVM\", \"analysis_dtests/input.template\")\n",
    "makeD5(geminata, virg_l+virg_f, sagraena, oleo_h, whiteoaks, 46, \"GVSO\", \"analysis_dtests/input.template\")\n",
    "makeD5(oleo_h, sagraena, virg_l+virg_f, geminata, whiteoaks, 47, \"OSVG\", \"analysis_dtests/input.template\")\n",
    "makeD5(virg_l, virg_f, sagraena, oleo_h, whiteoaks, 48, \"VVSO\", \"analysis_dtests/input.template\")\n",
    "makeD5(oleo_h, sagraena, virg_f, virg_l, whiteoaks, 49, \"OSVV\", \"analysis_dtests/input.template\")\n",
    "makeD5(geminata, minima, sagraena, oleo_h, whiteoaks, 50, \"GMSO\", \"analysis_dtests/input.template\")\n",
    "makeD5(oleo_h, sagraena, geminata, minima, whiteoaks, 51, \"OSMG\", \"analysis_dtests/input.template\")\n",
    "\n",
    "## testing cuba hypothesis 3, contrasts oleoides, Florida, no Sagraeana\n",
    "makeD5(minima, geminata, oleo_h, oleo_m, whiteoaks, 52, \"MGOO\", \"analysis_dtests/input.template\")\n",
    "makeD5(minima, virg_l+virg_f, oleo_h, oleo_m, whiteoaks, 53, \"MVOO\", \"analysis_dtests/input.template\")\n",
    "makeD5(geminata, virg_l+virg_f, oleo_h, oleo_m, whiteoaks, 54, \"GVOO\", \"analysis_dtests/input.template\")\n",
    "makeD5(virg_l, virg_f, oleo_h, oleo_m, whiteoaks, 55, \"VVOO\", \"analysis_dtests/input.template\")\n",
    "makeD5(oleo_m, oleo_h, geminata, minima, whiteoaks, 56, \"OOGM\", \"analysis_dtests/input.template\")\n",
    "makeD5(oleo_m, oleo_h, virg_l+virg_f, minima, whiteoaks, 57, \"OOVM\", \"analysis_dtests/input.template\")\n",
    "makeD5(oleo_m, oleo_h, virg_l+virg_f, geminata, whiteoaks, 58, \"OOVG\", \"analysis_dtests/input.template\")\n",
    "makeD5(oleo_m, oleo_h, geminata, minima, whiteoaks, 59, \"OOVV\", \"analysis_dtests/input.template\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "## run tests 26-50\n",
    "for test in range(26,60):\n",
    "    t = glob.glob(\"analysis_dtests/input.\"+str(test)+\".*\")[0]\n",
    "    stderr = ! ~/Dropbox/pyrad-git/pyRAD -d $t"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Analysis of D-statistic results\n",
    "\n",
    "Python functions used to calculate significance of D-statistic tests with Holm-bonferroni correction, and to test for correlation with geographic distances. Sampling locations are in Supplementary Table Sxx. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "## strips junk from Dtest results files to make names match in G dict'\n",
    "def nname(x):\n",
    "    x = x.replace(\"]\",\"\").replace(\"[\",\"\").strip()\n",
    "    return x"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "## returns number of significant tests at some alpha level of significance\n",
    "## after sequential bonferoni correction.\n",
    "from itertools import groupby\n",
    "import scipy.stats\n",
    "\n",
    "def correct(Zs,alpha):\n",
    "    \"\"\" performs Holm-bonferroni correction on \n",
    "        a list of Z-scores \"\"\"\n",
    "    PS = []\n",
    "    Zs = list(Zs)\n",
    "    pone = map(scipy.stats.distributions.norm.sf, Zs)\n",
    "    pvals = map(lambda x: x*2, pone)  ## two-sided test\n",
    "    pvals_idxs = zip(pvals, xrange(len(pvals)))\n",
    "    pvals_idxs.sort()\n",
    "    lp = len(pvals)\n",
    "    for pval, idxs in groupby(pvals_idxs, lambda x: x[0]):\n",
    "        idxs = list(idxs)\n",
    "        for p, i in idxs:\n",
    "            if p < alpha/lp:\n",
    "                PS.append(i)\n",
    "        lp -= len(idxs)\n",
    "    return [len(PS),len(Zs),min(Zs),max(Zs)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "## reads in Z scores from the pyRAD output and return summary\n",
    "def getZs(infile,Ntest,exclude=[]):\n",
    "    \"\"\" returns the Z-scores from a \n",
    "        PyRAD Dtest output file \"\"\"\n",
    "    lines = open(infile,'r').readlines()\n",
    "    header, data = lines[0:1],lines[2:]\n",
    "    dat1 = [i.strip().split(\"\\t\") for i in data]\n",
    "    for name in exclude:\n",
    "        dat1 = [line for line in dat1 if name not in \"\".join(line)]\n",
    "    D = np.array([i for i in dat1])\n",
    "    Za = []\n",
    "    Zb = []\n",
    "    Zab = []\n",
    "    taxorder = []\n",
    "    L = []\n",
    "    \n",
    "    if Ntest == 5:\n",
    "        for line in dat1:\n",
    "            p1,p2,p31,p32,o = line[0:5]\n",
    "            Zab.append(line[8])\n",
    "            Za.append(line[9])\n",
    "            Zb.append(line[10])\n",
    "            L.append(line[17])\n",
    "            taxorder.append([p1,p2,p31,p32,o])\n",
    "        RR = [[map(float,Zab),map(float,Za),map(float,Zb)],min(map(int,L)),max(map(int,L)),taxorder]\n",
    "    \n",
    "    if Ntest == 4:\n",
    "        taxorder = zip([D[0],D[1],D[2]])\n",
    "        Z,L = D[:,6],D[:,9]\n",
    "        RR = [map(float,Z), min(map(int,L)), max(map(int,L)),taxorder]\n",
    "        \n",
    "    return RR"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "## for printing results of bonferoni test\n",
    "def retHB(infile, Ntest, alpha, exclude=[], header=False, ptax=False):\n",
    "    \"\"\" combines the previous two functions into one, \n",
    "        and returns the output in pretty format \"\"\"\n",
    "    if Ntest == 4:\n",
    "        if header == True:\n",
    "            print \"\\t\".join([\"range_Z\", \"N tests/Sig\", \"range_N_loci\"]) \n",
    "        Zs,minL,maxL,taxorder = getZs(infile, Ntest, exclude)\n",
    "        s,t,minz,maxz = correct(Zs, alpha)\n",
    "        return [minz,maxz,s,t,minL,maxL]\n",
    "        \n",
    "    elif Ntest == 5:\n",
    "        if header == True:\n",
    "            print \"\\t\".join([\"range_Z12\", \"Sig\",\n",
    "                             \"range_Z1\", \"Sig\", \n",
    "                             \"range_Z2\", \"Sig\", \n",
    "                             \"range_N_loci\"]) \n",
    "        ZS,minL,maxL,taxorder = getZs(infile, Ntest, exclude)\n",
    "        s12,t12,minz12,maxz12 = correct(ZS[0], alpha)\n",
    "        s1,t1,minz1,maxz1 = correct(ZS[1], alpha)\n",
    "        s2,t2,minz2,maxz2 = correct(ZS[2], alpha)\n",
    "        \n",
    "        return [minz12,maxz12,s12,t12,\n",
    "                minz1,maxz1,s1,t1,\n",
    "                minz2,maxz2,s2,t2,\n",
    "                minL,maxL]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Summarized results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "out.1.GGM.D4.txt  \t(0.0 - 2.3)\t0/23\t(8603 - 15297)\n",
      "out.2.MMG.D4.txt  \t(1.3 - 6.8)\t12/23\t(8008 - 12652)\n",
      "out.3.GGV.D4.txt  \t(0.2 - 2.4)\t0/17\t(10283 - 20388)\n",
      "out.4.MMV.D4.txt  \t(0.2 - 4.7)\t7/17\t(8015 - 13022)\n",
      "out.5.MGV.D4.txt  \t(0.1 - 7.9)\t28/47\t(8402 - 16087)\n",
      "out.6.VVM.D4.txt  \t(0.0 - 1.6)\t0/11\t(9873 - 16589)\n",
      "out.7.VVG.D4.txt  \t(0.1 - 2.5)\t0/11\t(11854 - 20361)\n",
      "out.8.VGM.D4.txt  \t(0.0 - 3.9)\t1/47\t(8402 - 16087)\n",
      "out.9.OSmgv.D4.txt  \t(3.1 - 16.2)\t164/164\t(6216 - 17821)\n",
      "out.10.SmgvO.D4.txt  \t(14.7 - 36.4)\t164/164\t(6216 - 17821)\n",
      "out.11.mgvOS.D4.txt  \t(6.8 - 25.8)\t164/164\t(6216 - 17821)\n",
      "out.12.OOF.D4.txt  \t(0.0 - 1.6)\t0/39\t(10541 - 16074)\n",
      "out.13.OOB.D4.txt  \t(0.1 - 2.4)\t0/29\t(11698 - 17326)\n",
      "out.14.BFO.D4.txt  \t(0.0 - 8.1)\t29/59\t(10665 - 18638)\n",
      "out.15.BFS.D4.txt  \t(0.9 - 8.1)\t30/35\t(7506 - 17427)\n",
      "out.16.BFmgv.D4.txt  \t(1.3 - 17.9)\t119/131\t(8819 - 19929)\n",
      "out.17.BBF.D4.txt  \t(0.2 - 2.6)\t0/11\t(12587 - 20569)\n",
      "out.18.SSmgv.D4.txt  \t(0.0 - 4.1)\t2/32\t(6295 - 14053)\n",
      "out.19.MVS.D4.txt  \t(1.1 - 7.1)\t17/35\t(6191 - 15894)\n",
      "out.20.GVS.D4.txt  \t(0.0 - 6.9)\t18/47\t(6235 - 14695)\n",
      "out.21.VGS.D4.txt  \t(0.0 - 2.9)\t0/35\t(7166 - 18320)\n",
      "out.22.mgVFt.D4.txt  \t(3.6 - 10.6)\t47/47\t(8240 - 16372)\n",
      "out.23.OOVFm.D4.txt  \t(0.0 - 1.7)\t0/19\t(12470 - 16074)\n",
      "out.24.SSO.D4.txt  \t(0.0 - 4.1)\t2/14\t(7195 - 13110)\n",
      "out.25.OOS.D4.txt  \t(0.1 - 7.3)\t10/29\t(7856 - 15871)\n",
      "out.99.OSV.D4.txt  \t(5.1 - 12.7)\t44/44\t(7146 - 17821)\n"
     ]
    }
   ],
   "source": [
    "import glob\n",
    "infiles = glob.glob(\"analysis_dtests/*.D4.txt\")\n",
    "infiles.sort(key=lambda x: int(x.split('.')[1]))\n",
    "D = {}\n",
    "for ff in infiles:\n",
    "    rr = retHB(ff,4,0.01,[],0)\n",
    "    ss = \"\\t\".join(map(str,[ff.split(\"/\")[-1]]))\n",
    "    D[ff.split(\"/\")[-1][4:-7]] = tuple(rr)\n",
    "    print \"%s  \\t(%.1f - %.1f)\\t%s/%s\\t(%s - %s)\" % (tuple([ff.split(\"/\")[-1]])+tuple(rr))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introgression from the outgroups?\n",
    "\n",
    "This test uses the more distant red oaks as an outgroup to test introgression into Q. minima relative to Q. virginiana or Q. geminata from a more distant white oak. This result was suggested by the Treemix results. However, we find little evidence of introgression from more distant white oaks, and because they are so divergent I suspect that if there was gene flow it would be very apparent. \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    P1         P2         P3    O            D  std(D)     Z    BABA    ABBA  \\\n",
      "0   [FLSA185]  [LALC2]    [DO]  [NI,HE]  0.101   0.051  1.99  209.25  256.25   \n",
      "1   [FLSA185]  [FLBA140]  [DO]  [NI,HE]  0.040   0.053  0.75  224.00  242.50   \n",
      "2   [FLSA185]  [FLSF54]   [DO]  [NI,HE]  0.050   0.052  0.96  208.75  230.75   \n",
      "3   [FLSA185]  [FLAB109]  [DO]  [NI,HE]  0.014   0.057  0.25  221.50  228.00   \n",
      "4   [FLSF47]   [LALC2]    [DO]  [NI,HE] -0.006   0.044  0.13  272.88  269.88   \n",
      "5   [FLSF47]   [FLBA140]  [DO]  [NI,HE]  0.032   0.041  0.79  270.00  288.00   \n",
      "6   [FLSF47]   [FLSF54]   [DO]  [NI,HE] -0.019   0.046  0.42  268.88  258.62   \n",
      "7   [FLSF47]   [FLAB109]  [DO]  [NI,HE]  0.024   0.054  0.45  257.88  270.62   \n",
      "8   [FLSA185]  [LALC2]    [AR]  [NI,HE]  0.045   0.054  0.83  283.25  309.75   \n",
      "9   [FLSA185]  [FLBA140]  [AR]  [NI,HE] -0.044   0.054  0.81  307.50  281.75   \n",
      "10  [FLSA185]  [FLSF54]   [AR]  [NI,HE] -0.022   0.046  0.48  285.50  273.25   \n",
      "11  [FLSA185]  [FLAB109]  [AR]  [NI,HE] -0.097   0.053  1.82  311.62  256.62   \n",
      "12  [FLSF47]   [LALC2]    [AR]  [NI,HE] -0.003   0.048  0.07  337.38  335.12   \n",
      "13  [FLSF47]   [FLBA140]  [AR]  [NI,HE]  0.061   0.041  1.50  327.62  370.38   \n",
      "14  [FLSF47]   [FLSF54]   [AR]  [NI,HE] -0.037   0.043  0.86  320.12  297.12   \n",
      "15  [FLSF47]   [FLAB109]  [AR]  [NI,HE]  0.019   0.043  0.45  304.75  316.75   \n",
      "\n",
      "    nloci  nboot  pdisc  notes  \n",
      "0   10493    200   0.05    NaN  \n",
      "1    9979    200   0.05    NaN  \n",
      "2    9618    200   0.06    NaN  \n",
      "3   10117    200   0.05    NaN  \n",
      "4   13372    200   0.05    NaN  \n",
      "5   12578    200   0.05    NaN  \n",
      "6   12141    200   0.05    NaN  \n",
      "7   12541    200   0.05    NaN  \n",
      "8   11237    200   0.06    NaN  \n",
      "9   10856    200   0.06    NaN  \n",
      "10  10532    200   0.06    NaN  \n",
      "11  10957    200   0.06    NaN  \n",
      "12  14318    200   0.05    NaN  \n",
      "13  13619    200   0.06    NaN  \n",
      "14  13183    200   0.06    NaN  \n",
      "15  13460    200   0.05    NaN  \n",
      "\n",
      "[16 rows x 13 columns]\n"
     ]
    }
   ],
   "source": [
    "print pd.read_table(\"analysis_dtests/supplemental/extras2.D4.txt\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Test Dfoil statistics for analyses in Fig.2\n",
    "\n",
    "These tests were performed after the other analyses, using pyRAD v3.0.1.\n",
    "This analysis was experimental (Dfoil method not yet published, my dfoil code not yet fully tested) and thus not included in the manuscript. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### change template to dfoil method\n",
    "This option excludes information from terminal SNP patterns (e.g., BAAAA or AAABA), since these can lead to biases when tip samples have different population sizes, especially when taking measurements including heterozygous sites. This is implemented as the Dfoil\\_alt method, as described in Pease \\& Hahn (2014) doi:10.1101/004689"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "! sed -i s/part\\ /foilalt/g analysis_dtests/input.template"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "200                           ## N bootstrap reps \r\n",
      "/home/deren/Documents/Oaks/Virentes/analysis_pyrad/outfiles/virentes_c85d6m4p5.loci   ## loc/path \r\n",
      "analysis_dtests/out      ## output file\r\n",
      "foilalt                           ## which test: 4,part,foil,foilalt\r\n",
      "20                           ## N cores (parallel) \r\n",
      "0                        ## output ABBA/BABA \r\n",
      "0                   ## output bootstraps \r\n",
      "-----------------------------------------------------------\r\n"
     ]
    }
   ],
   "source": [
    "! cat analysis_dtests/input.template"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "makeD5(brandegei, fusiformis_t, oleo_m+oleo_h, virg_f+virg_l, whiteoaks, 100, \"BFtOV\", \"analysis_dtests/input.template\")\n",
    "makeD5(brandegei, fusiformis_m, oleo_m+oleo_h, virg_f+virg_l, whiteoaks, 101, \"BFmOV\", \"analysis_dtests/input.template\")\n",
    "makeD5(brandegei, fusiformis_t, sagraena, virg_f+virg_l, whiteoaks, 102, \"BFtSV\", \"analysis_dtests/input.template\")\n",
    "makeD5(brandegei, fusiformis_m, sagraena, virg_f+virg_l, whiteoaks, 103, \"BFmSV\", \"analysis_dtests/input.template\")\n",
    "makeD5(brandegei, fusiformis_t, oleo_m+oleo_h, sagraena, whiteoaks, 104, \"BFtOS\", \"analysis_dtests/input.template\")\n",
    "makeD5(brandegei, fusiformis_m, oleo_m+oleo_h, sagraena, whiteoaks, 105, \"BFmOS\", \"analysis_dtests/input.template\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import glob\n",
    "for test in range(100,106):\n",
    "    t = glob.glob(\"analysis_dtests/input.\"+str(test)+\".*\")[0]\n",
    "    stderr = ! ~/Dropbox/pyrad-github/pyRAD -d $t"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}