{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# <span style=\"color:gray\">ipyrad-analysis toolkit:</span> window_extracter\n",
    "\n",
    "The `window_extracter` tool is used to extract sequence data from a selected genomic window, apply filters, concatenate, and write to a phylip or nexus file. This is useful for combining multiple reference mapped RAD loci into a single large contig, for example around a gene of interest, or for applying filtering options to a denovo or refmapped dataset to create new supermatrices for downstream phylogenetic analysis. It is particularly convenient for taking just the single largest RAD-seq assembly of a dataset (e.g., min4 coverage) and applying different filtering options to create alignment matrices for downstream phylogenetic analyses that each include different subsets of taxa and sites.\n",
    "\n",
    "Key features:\n",
    "\n",
    "1. Concatenate ref-mapped RAD loci in genomic windows.\n",
    "2. Apply filters to remove sites or taxa by missing data.\n",
    "3. Optionally apply consensus calling to reduce missingness by reducing multiple samples to a single representative."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"https://eaton-lab.org/slides/data-svg/window-extracter-min4.svg\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Required software"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# conda install ipyrad -c conda-forge -c bioconda"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'0.9.61'"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import ipyrad.analysis as ipa\n",
    "ipa.__version__"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Required input data files\n",
    "Your input data should be a `.seqs.hdf` database file produced by ipyrad. This file contains the full sequence alignment for your samples as well as associated meta-data of the genomic positions of RAD loci relative to a reference genome. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# path to an HDF5 formatted seqs file\n",
    "SEQSFILE = \"/tmp/oaks.seqs.hdf5\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "file already exists\n"
     ]
    }
   ],
   "source": [
    "# download example seqs file if not already present (~500Mb, takes ~5 minutes)\n",
    "URL = \"https://www.dropbox.com/s/c1u89nwuuv8e6ie/virentes_ref.seqs.hdf5?raw=1\"\n",
    "ipa.download(URL, path=SEQSFILE);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### The scaffold table\n",
    "\n",
    "The `window_extracter()` tool takes the `.seqs.hdf5` database file from ipyrad as its input file. You select scaffolds by their index (integer) which can be found in the `.scaffold_table`. We can see from the table below that this genome has 12 large scaffolds (chromosome-scale linkage blocks) and many other smaller unplaced scaffolds. If you are working with a high quality reference genome then it will likely look similar to this, whereas many other reference genomes will be composed of many more scaffolds that are mostly smaller in size. Here I will focus just on the large chromosomes. Take note that the scaffolds are ordered the same as in the reference genome file, which may not involve the largest scaffolds coming first, so you may need to sort the table by length."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>scaffold_name</th>\n",
       "      <th>scaffold_length</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Qrob_Chr01</td>\n",
       "      <td>55068941</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Qrob_Chr02</td>\n",
       "      <td>115639695</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Qrob_Chr03</td>\n",
       "      <td>57474983</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Qrob_Chr04</td>\n",
       "      <td>44977106</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Qrob_Chr05</td>\n",
       "      <td>70629082</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Qrob_Chr06</td>\n",
       "      <td>57352617</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Qrob_Chr07</td>\n",
       "      <td>51661711</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Qrob_Chr08</td>\n",
       "      <td>71345938</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Qrob_Chr09</td>\n",
       "      <td>50221317</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>Qrob_Chr10</td>\n",
       "      <td>50368918</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>Qrob_Chr11</td>\n",
       "      <td>52130961</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>Qrob_Chr12</td>\n",
       "      <td>39860516</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>Qrob_H2.3_Sc0000024</td>\n",
       "      <td>2943817</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>Qrob_H2.3_Sc0000026</td>\n",
       "      <td>2906018</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>Qrob_H2.3_Sc0000030</td>\n",
       "      <td>2801502</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          scaffold_name  scaffold_length\n",
       "0            Qrob_Chr01         55068941\n",
       "1            Qrob_Chr02        115639695\n",
       "2            Qrob_Chr03         57474983\n",
       "3            Qrob_Chr04         44977106\n",
       "4            Qrob_Chr05         70629082\n",
       "5            Qrob_Chr06         57352617\n",
       "6            Qrob_Chr07         51661711\n",
       "7            Qrob_Chr08         71345938\n",
       "8            Qrob_Chr09         50221317\n",
       "9            Qrob_Chr10         50368918\n",
       "10           Qrob_Chr11         52130961\n",
       "11           Qrob_Chr12         39860516\n",
       "12  Qrob_H2.3_Sc0000024          2943817\n",
       "13  Qrob_H2.3_Sc0000026          2906018\n",
       "14  Qrob_H2.3_Sc0000030          2801502"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# first load the data file with no other arguments to see scaffold table\n",
    "ext = ipa.window_extracter(SEQSFILE)\n",
    "\n",
    "# the scaffold table shows scaffold names and lens in order of the ref. genome\n",
    "ext.scaffold_table.head(15)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Selecting scaffolds\n",
    "The `scaffold_idxs` designates the scaffold to extract sequence data from. This is the index (row) of the named scaffold from the scaffold table (e.g., above). The `window_extracter` tool will select all RAD data within this window and exclude any sites that have no data (e.g., the space between RAD markers, or the space between paired reads) to create a clean concise alignment. \n",
    "\n",
    "The `.stats` attribute shows the information content of the selected window before and after filtering. The stats are returned as a dataframe, showing the size, information content, missingness, and number of samples in the alignment. For example, the code cell below selects all RAD data that mapped to the first scaffold (index 0). Very few sites were filtered because we have not yet applied any filtering options (the few that were removed represent sites that are all Ns)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>scaffold</th>\n",
       "      <th>start</th>\n",
       "      <th>end</th>\n",
       "      <th>sites</th>\n",
       "      <th>snps</th>\n",
       "      <th>missing</th>\n",
       "      <th>samples</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>prefilter</th>\n",
       "      <td>Qrob_Chr01</td>\n",
       "      <td>0</td>\n",
       "      <td>None</td>\n",
       "      <td>890747</td>\n",
       "      <td>34015</td>\n",
       "      <td>0.47</td>\n",
       "      <td>37</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>postfilter</th>\n",
       "      <td>Qrob_Chr01</td>\n",
       "      <td>0</td>\n",
       "      <td>None</td>\n",
       "      <td>886039</td>\n",
       "      <td>34013</td>\n",
       "      <td>0.47</td>\n",
       "      <td>37</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              scaffold  start   end   sites   snps  missing  samples\n",
       "prefilter   Qrob_Chr01      0  None  890747  34015     0.47       37\n",
       "postfilter  Qrob_Chr01      0  None  886039  34013     0.47       37"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# select a scaffold idx, start, and end positions\n",
    "ext = ipa.window_extracter(\n",
    "    data=SEQSFILE,\n",
    "    scaffold_idxs=0,\n",
    ")\n",
    "\n",
    "# show stats of the window\n",
    "ext.stats"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Subsetting scaffold windows\n",
    "You can use the `start` and `end` arguments to select subsets of scaffolds as smaller window sizes to be extracted. As with the example above the selected window will be filtered to reduce missing data. If there is no data in the selected window the stats will show no sites, and a warning will be printed. An example with no data and with some data are both shown below. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "No data in selected window.\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>scaffold</th>\n",
       "      <th>start</th>\n",
       "      <th>end</th>\n",
       "      <th>sites</th>\n",
       "      <th>snps</th>\n",
       "      <th>missing</th>\n",
       "      <th>samples</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>prefilter</th>\n",
       "      <td>Qrob_Chr01</td>\n",
       "      <td>0</td>\n",
       "      <td>10000</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>postfilter</th>\n",
       "      <td>Qrob_Chr01</td>\n",
       "      <td>0</td>\n",
       "      <td>10000</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              scaffold  start    end  sites  snps  missing  samples\n",
       "prefilter   Qrob_Chr01      0  10000      0     0        0        0\n",
       "postfilter  Qrob_Chr01      0  10000      0     0        0        0"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# select a scaffold idx, start, and end positions\n",
    "ext = ipa.window_extracter(\n",
    "    data=SEQSFILE,\n",
    "    scaffold_idxs=0,\n",
    "    start=0, \n",
    "    end=10000,\n",
    ")\n",
    "\n",
    "# show stats of the window\n",
    "ext.stats"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>scaffold</th>\n",
       "      <th>start</th>\n",
       "      <th>end</th>\n",
       "      <th>sites</th>\n",
       "      <th>snps</th>\n",
       "      <th>missing</th>\n",
       "      <th>samples</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>prefilter</th>\n",
       "      <td>Qrob_Chr01</td>\n",
       "      <td>500000</td>\n",
       "      <td>800000</td>\n",
       "      <td>5023</td>\n",
       "      <td>235</td>\n",
       "      <td>0.34</td>\n",
       "      <td>37</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>postfilter</th>\n",
       "      <td>Qrob_Chr01</td>\n",
       "      <td>500000</td>\n",
       "      <td>800000</td>\n",
       "      <td>5019</td>\n",
       "      <td>235</td>\n",
       "      <td>0.34</td>\n",
       "      <td>37</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              scaffold   start     end  sites  snps  missing  samples\n",
       "prefilter   Qrob_Chr01  500000  800000   5023   235     0.34       37\n",
       "postfilter  Qrob_Chr01  500000  800000   5019   235     0.34       37"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# select a scaffold idx, start, and end positions\n",
    "ext = ipa.window_extracter(\n",
    "    data=SEQSFILE,\n",
    "    scaffold_idxs=0,\n",
    "    start=500000, \n",
    "    end=800000,\n",
    ")\n",
    "\n",
    "# show stats of the window\n",
    "ext.stats"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Filtering missing data with `mincov`\n",
    "\n",
    "You can filter sites from the alignment by using `mincov`, which applies a filter to all sites in the alignment. For example, `mincov=0.5` will require that 50% of samples contain a site that is not `N` or `-` for the site to be included in the alignment. This value can be a proportion like 0.5, or it can be a number, like 10. Similarly, the argument `rmincov` filters rows (taxa) from the matrix if they have data for less than some percentage of sites in the matrix.\n",
    "\n",
    "<img src=\"https://eaton-lab.org/slides/data-svg/window-extracter-min8.svg\">"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>scaffold</th>\n",
       "      <th>start</th>\n",
       "      <th>end</th>\n",
       "      <th>sites</th>\n",
       "      <th>snps</th>\n",
       "      <th>missing</th>\n",
       "      <th>samples</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>prefilter</th>\n",
       "      <td>Qrob_Chr01</td>\n",
       "      <td>500000</td>\n",
       "      <td>800000</td>\n",
       "      <td>5023</td>\n",
       "      <td>235</td>\n",
       "      <td>0.34</td>\n",
       "      <td>37</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>postfilter</th>\n",
       "      <td>Qrob_Chr01</td>\n",
       "      <td>500000</td>\n",
       "      <td>800000</td>\n",
       "      <td>2632</td>\n",
       "      <td>166</td>\n",
       "      <td>0.07</td>\n",
       "      <td>36</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              scaffold   start     end  sites  snps  missing  samples\n",
       "prefilter   Qrob_Chr01  500000  800000   5023   235     0.34       37\n",
       "postfilter  Qrob_Chr01  500000  800000   2632   166     0.07       36"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# select a scaffold idx, start, and end positions\n",
    "ext = ipa.window_extracter(\n",
    "    data=SEQSFILE,\n",
    "    scaffold_idxs=0,\n",
    "    start=500000, \n",
    "    end=800000,\n",
    "    mincov=0.8,\n",
    "    rmincov=0.5,\n",
    ")\n",
    "\n",
    "# show stats of the window\n",
    "ext.stats"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Filtering missing data with `imap` and `minmap`\n",
    "\n",
    "An `imap` dictionary can be used to group samples into populations/species, as in the example below. It takes key,value pairs where the key is the name of the group, and the value is a list of sample names. One way to use an `imap` is to apply a `minmap` filter. This acts just like the global `mincov` filter, but applies to each group separately. Only if a site meets the minimum coverage argument for each group will it be retained in the data set. In this case the `imap` sampling selected 28 samples to include in the dataset and required 75% of data in each group which reduced the number of SNPs from 105 to 76. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "# assign samples to groups/taxa\n",
    "imap = {\n",
    "    \"reference\": [\"reference\"],\n",
    "    \"virg\": [\"TXWV2\", \"LALC2\", \"SCCU3\", \"FLSF33\", \"FLBA140\"],\n",
    "    \"mini\": [\"FLSF47\", \"FLMO62\", \"FLSA185\", \"FLCK216\"],\n",
    "    \"gemi\": [\"FLCK18\", \"FLSF54\", \"FLWO6\", \"FLAB109\"],\n",
    "    \"bran\": [\"BJSL25\", \"BJSB3\", \"BJVL19\"],\n",
    "    \"fusi\": [\"MXED8\", \"MXGT4\", \"TXGR3\", \"TXMD3\"],\n",
    "    \"sagr\": [\"CUVN10\", \"CUCA4\", \"CUSV6\"],\n",
    "    \"oleo\": [\"CRL0030\", \"HNDA09\", \"BZBB1\", \"MXSA3017\"],\n",
    "}\n",
    "\n",
    "# set a simple minmap requiring 1 sample from each group\n",
    "minmap = {name: 0.75 for name in imap}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>scaffold</th>\n",
       "      <th>start</th>\n",
       "      <th>end</th>\n",
       "      <th>sites</th>\n",
       "      <th>snps</th>\n",
       "      <th>missing</th>\n",
       "      <th>samples</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>prefilter</th>\n",
       "      <td>Qrob_Chr01</td>\n",
       "      <td>500000</td>\n",
       "      <td>800000</td>\n",
       "      <td>5023</td>\n",
       "      <td>105</td>\n",
       "      <td>0.33</td>\n",
       "      <td>28</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>postfilter</th>\n",
       "      <td>Qrob_Chr01</td>\n",
       "      <td>500000</td>\n",
       "      <td>800000</td>\n",
       "      <td>2355</td>\n",
       "      <td>76</td>\n",
       "      <td>0.06</td>\n",
       "      <td>28</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              scaffold   start     end  sites  snps  missing  samples\n",
       "prefilter   Qrob_Chr01  500000  800000   5023   105     0.33       28\n",
       "postfilter  Qrob_Chr01  500000  800000   2355    76     0.06       28"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# select a scaffold idx, start, and end positions\n",
    "ext = ipa.window_extracter(\n",
    "    data=SEQSFILE,\n",
    "    scaffold_idxs=0,\n",
    "    start=500000, \n",
    "    end=800000,\n",
    "    mincov=0.8,\n",
    "    imap=imap,\n",
    "    minmap=minmap,\n",
    ")\n",
    "\n",
    "# show stats of the window\n",
    "ext.stats"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Subsample taxa with `imap`\n",
    "\n",
    "You can use an imap dictionary to select which samples to include/exclude from an analysis. This is an easy way to remove rogue taxa, hybrids, or technical replicates from phylogenetic analyses. Here I select a subset ot taxa to include in the analyses and keep only sites that have 80% coverage from scaffold 2 (Qrob_Chr03). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>scaffold</th>\n",
       "      <th>start</th>\n",
       "      <th>end</th>\n",
       "      <th>sites</th>\n",
       "      <th>snps</th>\n",
       "      <th>missing</th>\n",
       "      <th>samples</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>prefilter</th>\n",
       "      <td>Qrob_Chr03</td>\n",
       "      <td>0</td>\n",
       "      <td>None</td>\n",
       "      <td>875960</td>\n",
       "      <td>9488</td>\n",
       "      <td>0.51</td>\n",
       "      <td>13</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>postfilter</th>\n",
       "      <td>Qrob_Chr03</td>\n",
       "      <td>0</td>\n",
       "      <td>None</td>\n",
       "      <td>321858</td>\n",
       "      <td>5849</td>\n",
       "      <td>0.10</td>\n",
       "      <td>13</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              scaffold  start   end   sites  snps  missing  samples\n",
       "prefilter   Qrob_Chr03      0  None  875960  9488     0.51       13\n",
       "postfilter  Qrob_Chr03      0  None  321858  5849     0.10       13"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# select a scaffold idx, start, and end positions\n",
    "ext = ipa.window_extracter(\n",
    "    data=SEQSFILE,\n",
    "    scaffold_idxs=2,\n",
    "    mincov=0.8,\n",
    "    imap={\n",
    "        \"include\": [\n",
    "            \"TXWV2\", \"LALC2\", \"SCCU3\", \"FLSF33\", \"FLBA140\",\n",
    "            \"FLSF47\", \"FLMO62\", \"FLSA185\", \"FLCK216\",\n",
    "            \"FLCK18\", \"FLSF54\", \"FLWO6\", \"FLAB109\",\n",
    "        ]\n",
    "    },\n",
    ")\n",
    "\n",
    "# show stats of the window\n",
    "ext.stats"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Concatenate multiple scaffolds together\n",
    "You can also concatenate multiple scaffolds together using `window_extracter`. This can be useful for creating genome-wide alignments, or smaller subsets of the genome. For example, you may want to combine multiple scaffolds from the same chromosome together, or, if you are working with denovo data, you could even combine a random sample of anonymous loci together as a sort of pseudo bootstrapping procedure. To select multiple scaffolds you simply provide a list or range of scaffold idxs. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>scaffold</th>\n",
       "      <th>start</th>\n",
       "      <th>end</th>\n",
       "      <th>sites</th>\n",
       "      <th>snps</th>\n",
       "      <th>missing</th>\n",
       "      <th>samples</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>concatenated</td>\n",
       "      <td>0</td>\n",
       "      <td>3248430</td>\n",
       "      <td>3248430</td>\n",
       "      <td>168876</td>\n",
       "      <td>0.192</td>\n",
       "      <td>37</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       scaffold  start      end    sites    snps  missing  samples\n",
       "0  concatenated      0  3248430  3248430  168876    0.192       37"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# select a scaffold idx, start, and end positions\n",
    "ext = ipa.window_extracter(\n",
    "    data=SEQSFILE,\n",
    "    scaffold_idxs=[0, 1, 2, 3, 4, 5],\n",
    "    mincov=0.5,\n",
    ")\n",
    "\n",
    "# show stats of the window\n",
    "ext.stats"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Consensus reduction with `imap`\n",
    "\n",
    "You can further reduce missing data by condensing data from multiple samples into a single \"consensus\" representative using the `consensus_reduce=True` option. This uses the `imap` dictionary to group samples into groups and sample the most frequent allele. This can be particularly useful for analyses in which you want dense species-level coverage with little missing data, but it is not particularly important which individual represents the sampled allele for a species at a given locus. For example, if you want to construct many gene trees with one representative per species to use as input to a two-step species tree inference program like ASTRAL. (Note: to automate `window_extracter` calls across many windows of the genome see the `treeslider` tool.)\n",
    "\n",
    "In the example below you can see that the consensus option reduces the size of the dataset from 28 samples to 8 samples, and in doing so reduces the missing data from 19% to 1%."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"https://eaton-lab.org/slides/data-svg/consensus.svg\">"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>scaffold</th>\n",
       "      <th>start</th>\n",
       "      <th>end</th>\n",
       "      <th>sites</th>\n",
       "      <th>snps</th>\n",
       "      <th>missing</th>\n",
       "      <th>samples</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>prefilter</th>\n",
       "      <td>Qrob_Chr01</td>\n",
       "      <td>200000</td>\n",
       "      <td>5000000</td>\n",
       "      <td>51288</td>\n",
       "      <td>1622</td>\n",
       "      <td>0.19</td>\n",
       "      <td>28</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>postfilter</th>\n",
       "      <td>Qrob_Chr01</td>\n",
       "      <td>200000</td>\n",
       "      <td>5000000</td>\n",
       "      <td>48927</td>\n",
       "      <td>651</td>\n",
       "      <td>0.01</td>\n",
       "      <td>8</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              scaffold   start      end  sites  snps  missing  samples\n",
       "prefilter   Qrob_Chr01  200000  5000000  51288  1622     0.19       28\n",
       "postfilter  Qrob_Chr01  200000  5000000  48927   651     0.01        8"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# select a scaffold idx, start, and end positions\n",
    "ext = ipa.window_extracter(\n",
    "    data=SEQSFILE,\n",
    "    scaffold_idxs=0,\n",
    "    start=200000, \n",
    "    end=5000000,\n",
    "    mincov=0.8,\n",
    "    imap=imap,\n",
    "    minmap=minmap,\n",
    "    consensus_reduce=True,   # <--- uses IMAP info to make consensus calls\n",
    ")\n",
    "\n",
    "# show stats of the window\n",
    "ext.stats"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Write selected window to a file\n",
    "Once you've chosen the final set of arguments to select the window of interest you can write the alignment to *.phy* format by calling the `.run()` command. If you want to write to nexus format you can simply add the argument `nexus=True`. To change the name and location where file will be written you can set the `name` and `workdir` options on the window_extracter tool, otherwise it will use default options."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Wrote data to /home/deren/Documents/ipyrad/newdocs/API-analysis/analysis-window_extracter/scaf0-200000-5000000.phy\n"
     ]
    }
   ],
   "source": [
    "ext.run(force=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Accessing the output files\n",
    "\n",
    "The output files created by the `.run()` command will be written to the working directory (defaults to \"./analysis-window_extracter\"). You can either find the full path to that file or access it easily from the extracter object itself as an attribute like below. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'/home/deren/Documents/ipyrad/newdocs/API-analysis/analysis-window_extracter/scaf0-200000-5000000.phy'"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# path to the phylip file output\n",
    "ext.outfile"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}