{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# RecA deep mutational scanning libraries\n",
    "This example shows how to use [alignparse](https://jbloomlab.github.io/alignparse/index.html) to process PacBio circular consensus sequencing of a barcoded library of RecA variants for deep mutational scanning."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Set up for analysis\n",
    "Import necessary Python modules.\n",
    "We use [alignparse](https://jbloomlab.github.io/alignparse/index.html) for most of the operations, [plotnine](https://plotnine.readthedocs.io) for ggplot2-like plotting, and a few functitons from [dms_variants](https://jbloomlab.github.io/dms_variants):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-05-23T18:11:28.075817Z",
     "iopub.status.busy": "2024-05-23T18:11:28.075508Z",
     "iopub.status.idle": "2024-05-23T18:11:31.029873Z",
     "shell.execute_reply": "2024-05-23T18:11:31.028964Z",
     "shell.execute_reply.started": "2024-05-23T18:11:28.075789Z"
    }
   },
   "outputs": [],
   "source": [
    "import os\n",
    "import warnings\n",
    "\n",
    "import numpy\n",
    "\n",
    "import pandas as pd\n",
    "\n",
    "from plotnine import *\n",
    "\n",
    "import alignparse.ccs\n",
    "import alignparse.consensus\n",
    "import alignparse.minimap2\n",
    "import alignparse.targets\n",
    "from alignparse.constants import CBPALETTE\n",
    "\n",
    "import dms_variants.utils"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Suppress warnings that clutter output:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-05-23T18:11:31.034049Z",
     "iopub.status.busy": "2024-05-23T18:11:31.033729Z",
     "iopub.status.idle": "2024-05-23T18:11:31.037853Z",
     "shell.execute_reply": "2024-05-23T18:11:31.037005Z",
     "shell.execute_reply.started": "2024-05-23T18:11:31.034021Z"
    }
   },
   "outputs": [],
   "source": [
    "warnings.simplefilter(\"ignore\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Directory for output:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-05-23T18:11:31.041720Z",
     "iopub.status.busy": "2024-05-23T18:11:31.041385Z",
     "iopub.status.idle": "2024-05-23T18:11:31.045733Z",
     "shell.execute_reply": "2024-05-23T18:11:31.044937Z",
     "shell.execute_reply.started": "2024-05-23T18:11:31.041692Z"
    }
   },
   "outputs": [],
   "source": [
    "outdir = \"./output_files/\"\n",
    "os.makedirs(outdir, exist_ok=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Target amplicon\n",
    "We have performed sequencing of an amplicon that includes the RecA gene along with barcodes and several other features.\n",
    "The amplicon is defined in [Genbank Flat File format](https://www.ncbi.nlm.nih.gov/genbank/samplerecord/).\n",
    "First, let's look at that file.\n",
    "Note how it defines the features; this is how they must be defined to be handled by [alignparse](https://jbloomlab.github.io/alignparse/index.html).\n",
    "Note also how there are ambiguous nucleotides in the barcode and variant tag regions:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-05-23T18:11:31.050536Z",
     "iopub.status.busy": "2024-05-23T18:11:31.050145Z",
     "iopub.status.idle": "2024-05-23T18:11:31.056477Z",
     "shell.execute_reply": "2024-05-23T18:11:31.055658Z",
     "shell.execute_reply.started": "2024-05-23T18:11:31.050502Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "LOCUS       RecA_PacBio_amplicon    1342 bp ds-DNA     linear       06-AUG-2018\n",
      "DEFINITION  PacBio amplicon for deep mutational scanning of E. coli RecA.\n",
      "ACCESSION   None\n",
      "VERSION     \n",
      "SOURCE      Danny Lawrence\n",
      "  ORGANISM  .\n",
      "COMMENT     PacBio amplicon for RecA libraries.\n",
      "COMMENT     There are single nucleotide tags in the 5' and 3' termini to measure strand exchange.\n",
      "FEATURES             Location/Qualifiers\n",
      "     termini5        1..147\n",
      "                     /label=\"termini 5' of gene\"\n",
      "     gene            148..1206\n",
      "                     /label=\"RecA gene\"\n",
      "     spacer          1207..1285\n",
      "                     /label=\"spacer between gene & barcode\"\n",
      "     barcode         1286..1303\n",
      "                     /label=\"18 nucleotide barcode\"\n",
      "     termini3        1304..1342\n",
      "                     /label=\"termini 3' of barcode\"\n",
      "     variant_tag5    33..33\n",
      "                     /label=\"5' variant tag\"\n",
      "     variant_tag3    1311..1311\n",
      "                     /label=\"3' variant tag\"\n",
      "ORIGIN\n",
      "        1 gcacggcgtc acactttgct atgccatagc atRtttatcc ataagattag cggatcctac\n",
      "       61 ctgacgcttt ttatcgcaac tctctactgt ttctccataa cagaacatat tgactatccg\n",
      "      121 gtattacccg gcatgacagg agtaaaaATG GCTATCGACG AAAACAAACA GAAAGCGTTG\n",
      "      181 GCGGCAGCAC TGGGCCAGAT TGAGAAACAA TTTGGTAAAG GCTCCATCAT GCGCCTGGGT\n",
      "      241 GAAGACCGTT CCATGGATGT GGAAACCATC TCTACCGGTT CGCTTTCACT GGATATCGCG\n",
      "      301 CTTGGGGCAG GTGGTCTGCC GATGGGCCGT ATCGTCGAAA TCTACGGACC GGAATCTTCC\n",
      "      361 GGTAAAACCA CGCTGACGCT GCAGGTGATC GCCGCAGCGC AGCGTGAAGG TAAAACCTGT\n",
      "      421 GCGTTTATCG ATGCTGAACA CGCGCTGGAC CCAATCTACG CACGTAAACT GGGCGTCGAT\n",
      "      481 ATCGACAACC TGCTGTGCTC CCAGCCGGAC ACCGGCGAGC AGGCACTGGA AATCTGTGAC\n",
      "      541 GCCCTGGCGC GTTCTGGCGC AGTAGACGTT ATCGTCGTTG ACTCCGTGGC GGCACTGACG\n",
      "      601 CCGAAAGCGG AAATCGAAGG CGAAATCGGC GACTCTCATA TGGGCCTTGC GGCACGTATG\n",
      "      661 ATGAGCCAGG CGATGCGTAA GCTGGCGGGT AACCTGAAGC AGTCCAACAC GCTGCTGATC\n",
      "      721 TTCATCAACC AGATCCGTAT GAAAATTGGT GTGATGTTCG GCAACCCGGA AACCACTACC\n",
      "      781 GGTGGTAACG CGCTGAAATT CTACGCCTCT GTTCGTCTCG ACATCCGTCG TATCGGCGCG\n",
      "      841 GTGAAAGAGG GCGAAAACGT GGTGGGTAGC GAAACCCGCG TGAAAGTGGT GAAGAACAAA\n",
      "      901 ATCGCTGCGC CGTTTAAACA GGCTGAATTC CAGATCCTCT ACGGCGAAGG TATCAACTTC\n",
      "      961 TACGGCGAAC TGGTTGACCT GGGCGTAAAA GAGAAGCTGA TCGAGAAAGC AGGCGCGTGG\n",
      "     1021 TACAGCTACA AAGGTGAGAA GATCGGTCAG GGTAAAGCGA ATGCGACTGC CTGGCTGAAA\n",
      "     1081 GATAACCCGG AAACCGCGAA AGAGATCGAG AAGAAAGTAC GTGAGTTGCT GCTGAGCAAC\n",
      "     1141 CCGAACTCAA CGCCGGATTT CTCTGTAGAT GATAGCGAAG GCGTAGCAGA AACTAACGAA\n",
      "     1201 GATTTTTAAt cgtcttgttt gatacacaag ggtcgcatct gcggcccttt tgctttttta\n",
      "     1261 agttgtaagg atatgccatt ctagannnnn nnnnnnnnnn nnnagatcgg Yagagcgtcg\n",
      "     1321 tgtagggaaa gagtgtggta cc   \n",
      "//\n",
      "\n"
     ]
    }
   ],
   "source": [
    "recA_targetfile = \"input_files/recA_amplicon.gb\"\n",
    "\n",
    "with open(recA_targetfile) as f:\n",
    "    print(f.read())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Along with the Genbank file giving the sequence of the amplicon, we have a YAML file specifying how to filter and parse alignments to this amplicon.\n",
    "Below is the text of the YAML file.\n",
    "\n",
    "As you can see below, the YAML file specifies how well alignments must match the target in order to be retained.\n",
    "The query clipping indicates the max amount of the query that can be clipped at each end prior to the alignment.\n",
    "For each feature, there is a number indicating the max allowable number of nucleotides of that feature can be clipped in the alignment, as well as the max allowable number of mutated nucleotides (indels count in proportion to the number of nucleotide mutations) and mutation \"operations\" (indels count as one operation regardless of size).\n",
    "Below the mutation operation filter is all set to `null`, meaning that for this analysis all the filtering is done on the number of mutated nucleotides.\n",
    "When filters are missing for a feature, they are automatically set to zero.\n",
    "\n",
    "The YAML file also specifies what information is parsed from alignments that are not filtered.\n",
    "As you can see, for some features we parse the mutations or the full sequence of the feature, along with the accuracy of that feature in the sequencing query (computed from the Q-values):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-05-23T18:11:31.061598Z",
     "iopub.status.busy": "2024-05-23T18:11:31.061206Z",
     "iopub.status.idle": "2024-05-23T18:11:31.066683Z",
     "shell.execute_reply": "2024-05-23T18:11:31.065960Z",
     "shell.execute_reply.started": "2024-05-23T18:11:31.061567Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "RecA_PacBio_amplicon:\n",
      "  query_clip5: 4\n",
      "  query_clip3: 4\n",
      "  termini5:\n",
      "    filter:\n",
      "      clip5: 4\n",
      "      mutation_nt_count: 1\n",
      "      mutation_op_count: null\n",
      "  gene:\n",
      "    filter:\n",
      "      mutation_nt_count: 30\n",
      "      mutation_op_count: null\n",
      "    return: [mutations, accuracy]\n",
      "  spacer:\n",
      "    filter:\n",
      "      mutation_nt_count: 1\n",
      "      mutation_op_count: null\n",
      "  barcode:\n",
      "    return: [sequence, accuracy]\n",
      "  termini3:\n",
      "    filter:\n",
      "      clip3: 4\n",
      "      mutation_nt_count: 1\n",
      "      mutation_op_count: null\n",
      "  variant_tag5:\n",
      "    return: sequence\n",
      "  variant_tag3:\n",
      "    return: sequence\n",
      "\n"
     ]
    }
   ],
   "source": [
    "recA_parse_specs_file = \"input_files/recA_feature_parse_specs.yaml\"\n",
    "with open(recA_parse_specs_file) as f:\n",
    "    print(f.read())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now read the amplicon into an [alignparse.targets.Targets](https://jbloomlab.github.io/alignparse/alignparse.targets.html#alignparse.targets.Targets) object with the feature-parsing specs:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-05-23T18:11:31.071237Z",
     "iopub.status.busy": "2024-05-23T18:11:31.070654Z",
     "iopub.status.idle": "2024-05-23T18:11:31.103527Z",
     "shell.execute_reply": "2024-05-23T18:11:31.102551Z",
     "shell.execute_reply.started": "2024-05-23T18:11:31.071203Z"
    }
   },
   "outputs": [],
   "source": [
    "targets = alignparse.targets.Targets(\n",
    "    seqsfile=recA_targetfile, feature_parse_specs=recA_parse_specs_file\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(Note that although we just have one target in this example, there can be multiple targets specified in `seqsfile` and `feature_parse_specs` when initializing a [Targets](https://jbloomlab.github.io/alignparse/alignparse.targets.html#alignparse.targets.Targets). See the [Lassa virus glycoprotein](https://jbloomlab.github.io/alignparse/lasv_pilot.html) or [Single-cell virus sequencing](https://jbloomlab.github.io/alignparse/flu_virus_seq_example.html) example notebooks for examples of this.)\n",
    "\n",
    "We can plot the [Targets](https://jbloomlab.github.io/alignparse/alignparse.targets.html#alignparse.targets.Targets) object:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-05-23T18:11:31.105478Z",
     "iopub.status.busy": "2024-05-23T18:11:31.105175Z",
     "iopub.status.idle": "2024-05-23T18:11:31.344586Z",
     "shell.execute_reply": "2024-05-23T18:11:31.343791Z",
     "shell.execute_reply.started": "2024-05-23T18:11:31.105447Z"
    }
   },
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAA1YAAAEpCAYAAACQt7NWAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy80BEi2AAAACXBIWXMAAA9hAAAPYQGoP6dpAABZNUlEQVR4nO3dd1QUVxsG8GdZlqX3XgRpimLvHRVj793YW6yJiZqmxpZojMYYu1/svaBib1hi7y12UUFAlCoC0nZ3vj8IG1e6Cyzl+Z2z57Az99555zoh83Lv3BEJgiCAiIiIiIiIPpmWpgMgIiIiIiIq6ZhYERERERERqYmJFRERERERkZqYWBEREREREamJiRUREREREZGamFgRERERERGpiYkVERERERGRmphYERERERERqYmJFRERERERkZqYWBERUbGzfv16iEQiBAUFaTqUIpHV+fr4+MDHx0djMRERUf4wsSIiyqeMm+CMj7a2NhwcHDB48GCEhYUV+vEfPnwIkUgEXV1dvH379pPa+PgcdHV14enpiXHjxuHNmzcFG/C/fHx8VI6po6OD8uXLY+TIkQgJCSmUYxIRERUVbU0HQERUUs2aNQvly5dHcnIyLl++jPXr1+P8+fO4d+8edHV1C+24mzdvhq2tLWJjY+Hn54fhw4d/clsfnsP58+exYsUKHD58GPfu3YO+vn4BRp3O0dERc+fOBQCkpqbiwYMHWLlyJY4dO4aHDx8qjzlgwAD06dMHUqm0wGMoKY4fP67pEIiIKB+YWBERfaK2bduidu3aAIDhw4fD0tIS8+bNw/79+9GrV69COaYgCNi6dSv69euHFy9eYMuWLWolVh+fg4WFBRYuXIh9+/ahb9++BRW2komJCfr376+yrXz58hg3bhwuXLiAVq1aAQDEYjHEYnGBH78k0dHR0XQIRESUD5wKSERUQJo0aQIAePbsmXLbo0eP0KNHD5ibm0NXVxe1a9fG/v37M9V9+/Ytvv76a7i4uEAqlcLR0REDBw5EVFSUSrkLFy4gKCgIffr0QZ8+fXD27FmEhoYW2Dm0aNECAPDixQsAwIIFC9CwYUNYWFhAT08PtWrVgp+fX5Z1N2/ejLp160JfXx9mZmZo2rRpnkZdbG1tAQDa2v/9rS+7Z6yWL1+OypUrQyqVwt7eHmPHjs33dMiYmBhMmjQJVapUgaGhIYyNjdG2bVvcuXNHpdyZM2cgEomwc+dOzJw5Ew4ODjAyMkKPHj0QFxeHlJQUTJgwAdbW1jA0NMSQIUOQkpKi0oZIJMK4ceOwZcsWVKhQAbq6uqhVqxbOnj2ba5xZPWOVnJyMGTNmwNPTE7q6urCzs0O3bt1UrrnExERMnDgRTk5OkEqlqFChAhYsWABBELKMzd/fH97e3pBKpahcuTKOHj2ar/4kIqJ0HLEiIiogGUmAmZkZAOD+/fto1KgRHBwc8P3338PAwAA7d+5Ely5dsHv3bnTt2hUAkJCQgCZNmuDhw4cYOnQoatasiaioKOzfvx+hoaGwtLRUHmPLli1wc3NDnTp14O3tDX19fWzbtg2TJ08ukHPIuEG3sLAAAPz555/o1KkTPv/8c6SmpmL79u3o2bMnDh48iPbt2yvrzZw5EzNmzEDDhg0xa9Ys6Ojo4MqVKzh16hQ+++wzZTm5XK5MFtPS0vDw4UNMnz4d7u7uaNSoUY6xzZgxAzNnzoSvry9Gjx6Nx48fY8WKFbh27RouXLgAiUSSp3N8/vw5/P390bNnT5QvXx5v3rzBqlWr0KxZMzx48AD29vYq5efOnQs9PT18//33CAwMxJIlSyCRSKClpYXY2FjMmDFDORW0fPny+Omnn1Tq//3339ixYwe+/PJLSKVSLF++HG3atMHVq1fh7e2dp5gz+q5Dhw44efIk+vTpg6+++grx8fE4ceIE7t27Bzc3NwiCgE6dOuH06dMYNmwYqlevjmPHjmHy5MkICwvDH3/8odLm+fPnsWfPHowZMwZGRkZYvHgxunfvjpcvXyqvASIiyiOBiIjyZd26dQIAISAgQIiMjBRCQkIEPz8/wcrKSpBKpUJISIggCILQsmVLoUqVKkJycrKyrkKhEBo2bCh4eHgot/30008CAGHPnj2ZjqVQKJQ/p6amChYWFsKUKVOU2/r16ydUq1atQM5h+/btgoWFhaCnpyeEhoYKgiAI79+/V6mXmpoqeHt7Cy1atFBue/r0qaClpSV07dpVkMvl2cbfrFkzAUCmj5eXl/D8+fMs43vx4oUgCIIQEREh6OjoCJ999pnKMZYuXSoAENauXZvnc09OTs4U54sXLwSpVCrMmjVLue306dMCAMHb21tITU1Vbu/bt68gEomEtm3bqrTRoEEDwdnZWWVbxjlev35duS04OFjQ1dUVunbtmu35CkJ6fzVr1kz5fe3atQIAYeHChZnOKaOf/f39BQDCzz//rLK/R48egkgkEgIDA1Vi09HRUdl2584dAYCwZMmSTMcgIqKccSogEdEn8vX1hZWVFZycnNCjRw8YGBhg//79cHR0RExMDE6dOoVevXohPj4eUVFRiIqKQnR0NFq3bo2nT58qVxDcvXs3qlWrphzB+pBIJFL+fOTIEURHR6s8+9S3b1/cuXMH9+/fV/sc+vTpA0NDQ+zduxcODg4AAD09PWXZ2NhYxMXFoUmTJrh586Zyu7+/PxQKBX766Sdoaan+b+XD+AHAxcUFJ06cwIkTJ3DkyBEsWrQIcXFxaNu2LSIjI7ONMyAgAKmpqZgwYYLKMUaMGAFjY2McOnQoz+cslUqVbcjlckRHR8PQ0BAVKlRQOa8MAwcOVBkNq1evHgRBwNChQ1XK1atXDyEhIZDJZCrbGzRogFq1aim/lytXDp07d8axY8cgl8vzHPfu3bthaWmJ8ePHZ9qX0c+HDx+GWCzGl19+qbJ/4sSJEAQBR44cUdnu6+sLNzc35feqVavC2NgYz58/z3NcRESUjlMBiYg+0bJly+Dp6Ym4uDisXbsWZ8+eVa5iFxgYCEEQMG3aNEybNi3L+hEREXBwcMCzZ8/QvXv3XI+3efNmlC9fHlKpFIGBgQAANzc36OvrY8uWLZgzZ84nn4O2tjZsbGxQoUIFlcTl4MGD+Pnnn3H79m2V54c+TJiePXsGLS0tVKpUKdfjGRgYwNfXV/m9TZs2aNy4MWrXro1ff/0Vv//+e5b1goODAQAVKlRQ2a6jowNXV1fl/rxQKBT4888/sXz5crx48UIluclq+lu5cuVUvpuYmAAAnJycMm1XKBSIi4tTacfDwyNTm56ennj//j0iIyOVz5jl5tmzZ6hQoYLKs2gfCw4Ohr29PYyMjFS2e3l5Kfd/6ONzA9KnssbGxuYpJiIi+g8TKyKiT1S3bl3linpdunRB48aN0a9fPzx+/BgKhQIAMGnSJLRu3TrL+u7u7nk+1rt373DgwAEkJydneaO+detW/PLLL5lGiPJzDh87d+4cOnXqhKZNm2L58uWws7ODRCLBunXrsHXr1nwdJye1atWCiYlJnhZ0KAhz5szBtGnTMHToUMyePRvm5ubQ0tLChAkTlP9uH8pudcLstgsfLRJRnJWGcyAiKi6YWBERFQCxWIy5c+eiefPmWLp0qXKamEQiURmhyYqbmxvu3buXY5k9e/YgOTkZK1asUFnMAgAeP36MqVOn4sKFC2jcuLF6J/KB3bt3Q1dXF8eOHVN5n9S6desyxa9QKPDgwQNUr179k44ll8uRkJCQ7X5nZ2cA6efq6uqq3J6amooXL17k2scf8vPzQ/PmzbFmzRqV7W/fvs3UtwXh6dOnmbY9efIE+vr6sLKyynM7bm5uuHLlCtLS0rJdqMPZ2RkBAQGIj49XGbV69OiRcj8RERUOPmNFRFRAfHx8ULduXSxatAjGxsbw8fHBqlWrEB4enqnsh88Tde/eHXfu3MHevXszlcsYOdi8eTNcXV0xatQo9OjRQ+UzadIkGBoaYsuWLQV6PmKxGCKRSGWqXFBQEPz9/VXKdenSBVpaWpg1a1amEZ+8jHycPn0aCQkJqFatWrZlfH19oaOjg8WLF6u0uWbNGsTFxamsUJgbsVicKa5du3Ypn3kraJcuXVJ5diskJAT79u3DZ599lq93dXXv3h1RUVFYunRppn0Z59OuXTvI5fJMZf744w+IRCK0bdv2E8+CiIhywxErIqICNHnyZPTs2RPr16/HsmXL0LhxY1SpUgUjRoyAq6sr3rx5g0uXLiE0NFT53qTJkyfDz88PPXv2xNChQ1GrVi3ExMRg//79WLlyJaysrHD69OlMCxJkkEqlaN26NXbt2oXFixfnednx3LRv3x4LFy5EmzZt0K9fP0RERGDZsmVwd3fH3bt3leXc3d0xZcoUzJ49G02aNEG3bt0glUpx7do12NvbY+7cucqycXFx2Lx5MwBAJpMpl0zPWM48O1ZWVvjhhx8wc+ZMtGnTBp06dcLjx4+xfPly1KlTJ9NLh3PSoUMHzJo1C0OGDEHDhg3xzz//YMuWLSojYQXJ29sbrVu3VlluHUhfoj4/Bg4ciI0bN+Kbb77B1atX0aRJEyQmJiIgIABjxoxB586d0bFjRzRv3hxTpkxBUFAQqlWrhuPHj2Pfvn2YMGGCykIVRERUwDS1HCERUUmVsTT2tWvXMu2Ty+WCm5ub4ObmJshkMuHZs2fCwIEDBVtbW0EikQgODg5Chw4dBD8/P5V60dHRwrhx4wQHBwdBR0dHcHR0FAYNGiRERUUJv//+uwBAOHnyZLYxrV+/XgAg7Nu3T+1z+NCaNWsEDw8PQSqVChUrVhTWrVsnTJ8+Xcjqfx9r164VatSoIUilUsHMzExo1qyZcOLECeX+j5dbF4lEgrm5udCpUyfhxo0bWcb34fLjgpC+vHrFihUFiUQi2NjYCKNHjxZiY2PzdM4ZkpOThYkTJwp2dnaCnp6e0KhRI+HSpUuZljfPWG59165dWcb2cd9l9EtkZKRyGwBh7NixwubNm5X9WKNGDeH06dO5nu/H8QhC+vL3U6ZMEcqXLy9IJBLB1tZW6NGjh/Ds2TNlmfj4eOHrr78W7O3tBYlEInh4eAjz589XWfr+w9g+5uzsLAwaNCiHHiQioqyIBIFPqBIRERUGkUiEsWPHZjl9j4iIShc+Y0VERERERKQmPmNFRFSKJCQk5Li6HpD+vFJ+Fk0oKZKSkhAXF5djGXNzc+jo6BRRREREVJYwsSIiKkUWLFiQ66IIL168gIuLS9EEVIR27NiBIUOG5Fjm9OnT8PHxKZqAiIioTOEzVkREpcjz58/x/PnzHMs0btwYurq6RRRR0QkPD8f9+/dzLFOrVi2YmZkVUURERFSWMLEiIiIiIiJSExevICIiIiIiUhMTKyIiIiIiIjUxsSIiIiIiIlITEysiIiIiIiI1MbEiIiIiIiJSExMrIiIiIiIiNTGxIiIiIiIiUhMTKyIiIiIiIjUxsSIiIiIiIlITEysiIiIiIiI1MbEiIiIiIiJSExMrIiIiIiIiNTGxIiIiIiIiUhMTKyIiIiIiIjUxsSIiIiIiIlITEysiIiIiIiI1MbEiIiIiIiJSExMrIiIiIiIiNTGxIiIiIiIiUhMTKyIiIiIiIjUxsSIiIiIiIlITEysiIiIiIiI1MbEiIiIiIiJSExMrIiIiIiIiNTGxIiIiIiIiUhMTKyIiIiIiIjUxsSIiIiIiIlITEysiIiIiIiI1MbEiIiIiIiJSExMrIiIiIiIiNTGxIiIiIiIiUhMTKyIiIiIiIjUxsSIiIiIiIlITEysiIiIiIiI1MbEiIiIiIiJSExMrIiIiIiIiNTGxIiIiIiIiUhMTKyIiIiIiIjUxsSIiIiIiIlITEysiIiIiIiI1MbEiIiIiIiJSExMrIiIiIiIiNTGxIiIiIiIiUhMTKyIiIiIiIjUxsSIiIiIiIlITEysiIiIiIiI1MbEiIiIiIiJSExMrIiIiIiIiNTGxIiIiIiIiUhMTKyIiIiIiIjUxsSIiIiIiIlITEysiIiIiIiI1MbEiIiIiIiJSExMrIiIiIiIiNTGxIiIiIiIiUhMTKyIiIiIiIjUxsSIiIiIiIlITEysiIiIiIiI1MbEiIiIiIiJSExMrIiIiIiIiNTGxIiIiIiIiUhMTKyIiIiIiIjUxsSIiIiIiIlITEysiIiIiIiI1MbEiIiIiIiJSExMrIiIiIiIiNTGxIiIiIiIiUhMTKyIiIiIiIjUxsSIiIiIiIlITEysiIiIiIiI1MbEiIiIiIiJSExMrIiIiIiIiNTGxIiIiIiIiUhMTKyIiIiIiIjUxsSIiIiIiIlITEysiIiIiIiI1MbEiIiIiIiJSExMrIiIiIiIiNTGxIiIiIiIiUhMTKyIiIiIiIjUxsSIiIiIiIlITEysiIiIiIiI1MbEiIiIiIiJSExMrIiIiIiIiNTGxIiIiIiIiUpO2pgMgIiIiIqK8UygUePnyJcLCwpCWlqbpcDRGT08Prq6usLKy0nQoAJhYERERERGVGMHBwfDz80NCQgJ0dCSQSnU0HZJGCIKApKRkyOUK2Nvbo0+fPjAyMtJoTCJBEASNRkBERERERLl6/fo11q5dC3s7S7RoURcODjYQiUSaDktjZDIZnj0LxZGj5yGV6mLEiJHQ0dFcoslnrIiIiIiISoBr165BX0+Kvn3bwtHRtkwnVQCgra2NChVc0K9fO0RFRePJkycajYeJFRERERFRCfD48WNUquwGiUSi6VCKFWsrc9jZWeHx48cajYOJFRERERFRMadQKJCYmAhLC1NNh1IsWVqY4t27dxqNgYkVEREREVExp1AoAABa4qxv34cM/QpVq/kUYUQF5+3bOIi17bB+w45PbkMs1oJcLi/AqPKPiRUREREREZGamFgREREREVG2kpKSNB1CicDEioiIiIiolDhy5CSqVvOBvoEL6tT9DJcv31Du27hpJ5o27QRLKy9YWFZEixbdcPXqLZX6M2cugLGJG65evYVGjTpA38AFy5evBwA8fPgE3XsMhaWVFwyNyqNGzZbYtn2vsm5ycjImTpwOR6fq0DdwQc1avtjrfzhTjH+t3gxXtzowNCqPVq16IjDwRZbnsn7DDlSv0QL6Bi5wKlcDU6fO1fh0v5wwsSIiIiIiKgXCwyMwbvwPmDhxDLZvXwWpjhRt2/VFREQUACA4KBT9B/TEju3/w+ZNy+BUzgE+zbviyZNnKu2kpqah/4Ax+Pzz7jh0cAtatWqGp0+fo1HjjggMfIFFi2bDf+8GDB7UGyEvw5T1+g8Yi//9tQmTJ43Bnt1r4eXliZ49h2P/gWPKMgcPnsCoUZPh49MIu/3WokWLxujdZ2Smc/njj5UYOXIiPmvlg33+G/Dt5LFYsnQNpk79tZB6T33amg6AiIiIiIjUFxMTix3b/4cWLRoDAJo1bQBnl1pYtGgV5syZgmnTvlGWVSgUaNWqGa5du4UNG3bgl19+VO5LS0vD7Nnfo3evzspt/QeMgY6OBOfO7oexsREAwNe3qXL/3bsPsHfvYSxfPg9fjBwIAGjTpgWCg0Iwe/bv6NSxNQBgzpxFaNK4HtauWQQAaN26OZKTU/DzL38o24qPT8CMmQswedIYZVytWjWDREeCSZNmYNKk0bCwMC/IrisQHLEiIiIiIioFTEyMlUlVxveWLZsop/s9fPgE3boPgZ19FUh0HCDVdcLjx8/w5OnzTG21b+er8v3UqfPo3r2DMqn62PnzVwAAPXt0VNneq1cn3Lp1D4mJ7yGXy3Hj5l106dJWpUz37h1Uvl+8eA0JCYno0aMjZDKZ8uPbsimSkpJx796jPPZI0eKIFRERERFRKWBlZZFpm42NFR49eor4+AS0adsXVlbmWLBgBpzLOUJXV4qRX0xCcnKKSh19fT0YGhqobIuOjoW9nU22x46NfQuJRAJzczOV7dY2VhAEIX1JdbEYMpkMVtaWmWL8UFR0DACgdp3PsjxWSOirbOPQJCZWRERERESlQGRkdKZtb95Ews7WBpcuXUdo6Cvs37cR1apVVu6Pi3sHBwc7lToikShTOxYWZngV/ibbY5ubmyEtLQ2xsW9hZmaq3B7xJhIikQimpibQ1ZVCW1sbkf8+8/VhjCpt/Vvfz28NnBztMx2rfPly2cahSZwKSERERERUCsTFvcOpU+dVvp88eQ5169ZAUnIyAEBHR6Lcf/HiNQQFheSp7ZYtm2D37oOIj0/Icn+jRnUBALv8Dqhs9/M7gBo1vGFgoA+xWIyaNarA3/+ISpnduw+qfG/QoDb09fUQFhqO2rWrZ/oUx+erAI5YERERERGVCubmZhgx8htMnz4JpqYm+G3eUgiCgK++Sl91z9DQAOPH/4hvvx2HsFevMXPm/EyjVdn5adpEHDoUgKbNOmPSpDGws7XBw4dP8P59EiZPHouqVSuha9d2mDRpBpKTkuHp6YYtW3fj4qXr2LtnvbKdH378Cl27DsbQYRPQu1dn3Lx5F5u3+Kkcy9TUBDNnfIvvvv8ZoaHhaNasAcRiMZ6/CMb+/cfgt2s19PX1C6zfCgpHrIiIiIiISgE7O2ss/nMOfvttKXr3HonklGQcObwNNjZWsLGxwo7t/0NEZBS6dhuCxYv/worlv8HdzSVPbXt4uOL8uf1wdnbCuHE/oHOXgVi7bhvKOTsqy2zauBTDh32Oeb8tRdduQ3Dv3iPs3PkXOnb871mpTh1bY/nyeTh16hy6dR+KEyf+xratqzId75tvRmHNmj9w5swF9Ow1HL37jMTq1ZtRp3Z16OjoqN1XhUEkCIKg6SCIiIiIiCh7MpkMv/zyCzp3bo6qVTw1HU6xc+DAGURGJWD48OEai4EjVkRERERERGpiYkVEREREVFJwrlmWikO3MLEiIiIiIirmxGIxtLS0kJKSqulQiqWUlFSNP3vFxIqIiIiIqJgTiUSws7PDi6AwTYdS7MjlcgQHh8PBwUGjcTCxIiIiIiIqASpXroynT1/ixYtQTYdSrFy4eBtJScmoVKmSRuPgqoBERERERCWATCbD9u3bERQUBA+PcnAt7whdPSlEmg5MAxSCgIT493j8OAgvQ8Lh4+ODZs2aaTQmJlZERERERCWETCbDlStXcP/+fYSHh2s6HI3S0tKCq6srqlWrBm9vb02Hw8SKiIiIiKgkUigUSEtL09jxe/fujR07dmjk2CKRCBKJBCJR8Rmv09Z0AERERERElH9aWlqQSqUaO75CodDo8YsbLl5BRERERESkJiZWREREREREauJUQCIiIiKiMuTt27cIDAxEYmIi1Fluwc7ODmfOnCm4wDRALBbDwsIC7u7uar9gmItXEBERERGVAampqdi1axcCAwPTn8/S04OW6NMnsMnkcmiLxQUYYdGTyWRISU6CRCJB8+bN0aBBg09uiyNWRERERESlnCAI2Lp1K16Fh6NZuy5w8agIHS48AQB4FxuDf65fxvHjxyGRSFC7du1PaofPWBERERERlXKvXr1CcHAwmnfoBk/vakyqPmBsZo6Gvm3hWrEyLl669MnTI5lYERERERGVco8ePYKunj6cXD00HUqxJBKJ4OldDbExMYiKivqkNphYERERERGVcvHx8TAxM4eWFm//s2NqYQUAePfu3SfVZ88SEREREZVyCoUCWtksNHF0vz/Wr1pexBFlLSQ4CA76Ihzc65evehfPnoGDvgh3blxXbpv1wyQ0r1UZntZGqGBjjHaN62Dfru3ZtiH+t38UCsUnxc7FK4iIiIiIyrCjB/1x9+Z1DP5ijKZDgbWtHfafuQRXd8981atSvSb2n7kEj4peym2JCQnoN2QE3D0rQiQS4dBeP4wZ1BcKhQJde/cr6NCZWBERERERUcFISkqCnp7eJ9eXSqWoVbd+vusZGRtnqjdvyUqV7z6tWuPJowfYuXl9oSRWnApIRERERFRGTRg5GLs2b8DjB/fhoC+Cg74IE0YOBgBcv3IJPdu2gLulASrammDs4H6IiohQ1s2Ytrdj03pMHjMClR0t0KFpXQCAg74Iy36fh1+nT0FVZ2t42Zni5ynfQhAEnDt9Eq3qVYeHlSF6tWuJsNCQTG1+OBWwXkUXTPl6HNavXIa6FZxR0dYEQ3t1QXRkpLJMVlMBs2JmboG01NSC6LpMOGJFRERERFRGTfh+GqIjI/HsySMsWbcFAGBhaZWeVLX2QYvW7bBi4w68f5+I32ZOxZBenXHgzCWVNn796Qe0bNMey9dvU3k+ad3KpWjQ1AeLV2/CrWtXsODn6ZDL5Th36gTGfzsFOhIdTJv0JSaNHoZtB47nGOfxQ/vx4tlT/PLHMsRER2Hmd19j6sTxWLEx+2emgPT3d8nlciQmJODE4QM4e/I4Fq/d/Im9lTMmVkREREREZZSLqxssrKwQFhKsMpVu4qihqFqzNlZv3wORSAQA8KpcBS1qe+Pk0cNo2aadsmzlqtWxYMXqTG3b2tljyZpNANKn4R0/tB9/LfkDp2/cVz4L9fpVGKZOHI+4t29hYmqabZyCIGDdrv2Q/vv+rdDgICyZPyd9UY4cVjo8d/ok+nZoBQDQ1tbGzwuXokPXHnnsnfzhVEAiIiIiIlJKev8e1y5dQIduPSGXyyGTySCTyeDq4Ql7RyfcuXFNpXzLNu2zbKdJi1Yq3109PGFjZ6+ywISrR/oiFeFhoTnG1KBJM2VSBQAeXpWQlpamMjUxKzXr1MPhc9ew/VAAho+bgGkTx2Pb+jU51vlUHLEiIiIiIiKlt7GxkMvlmPHt15jx7deZ9r/64JkoALC0scmynY9HoCQ6OlluA4CU5OQcYzI2Ua2nI/m3XkrO9QyNjFCtVm0AQJPmLSGTyTDz+2/Qa8Bg5fLqBYWJFRERERERKZmYmkIkEmH85B/RpmOXTPvNLS1VvmdMFSwJqtaohdVLFyE6MhLWtrYF2jYTKyIiIiKiMkxHoqMyYqRvYIBa9Rog8PFDVJvxswYjK3hXL56HkbFxpuSwIDCxIiIiIiIqw9wremH7xrXw37kN5d08YG5pialz5qN32xYYNaA3OvfoAxMzM4SHheLsqRPoPWAIGjb10XTYOXrwz13MmfodOnTrCUdnF7xPSEDAkYPYun41fpg1F9raBZ8GMbEiIiIiIirD+g4ahtvXr2LqxPGIjY5Gz/6DsOh/67E34DwW/Dwd34wagtTUVNg5OKKxT0u4uLprOuRcWVnbwNjUFH/MnYXIN69hZGICd8+KWLN9L1p37FwoxxQJgiAUSstERERERFQs7NmzB1Fv36FD38GaDqXYep8Qjy3LF6Jfv37w8PDId30ut05ERERERKQmJlZERERERGUB56nlSN15fEysiIiIiIhKOYlEgtTUnN/5VNal/vtOLJ1/362VX0ysiIiIiIhKOXt7e8RERiApMVHToRRbYcHPoaWlBWtr60+qz8SKiIiIiKiU8/LyAgBcPRsArl2XWUL8O9y7fhmurm7Q09P7pDa4KiARERERURlw584d+Pv7w8LaFuUreMHIxAxaWmV3nEUAIJelITI8DM8fPYCOjgSDBw2CmZnZJ7XHxIqIiIiIqIx49uwZbty4icDAp0hLS9N0OMWCsYkJKnl5oX79+jAxMfnkdphYERERERGVMYIgIC0tTa1pgb1798aOHTvUiqMg2lCHWCyGtrZ2gbRVMK0QEREREVGJIRKJPnn1uwwKhQJSqVTjbRQXZXdSJRERERERUQHJdcRKEAQkJSVBLpcXRTyUDZFIBF1d3QIbqiQiIiKi4k0mkyE5ObnYruInkUgQHx+v8TYKm5aWFvT09HJd6CPbZ6xiYmJw4cIFPHr0EO/fJxVKkJQ/IpEIruXLo1bt2solM4mIiIiodHn06BGuX7+OFy9eQKFQaDocQvpLgz09PdGgQQPY29tnWSbLxCo6OhobNqwHBAFVKrjBwc4aEm1xYcdLOVAoBLx9F48HT17g5avXaNu2LerWravpsIiIiIioAN24cQMHDx6Eg509KrpXgKmxcZleEr04kMnliIyOwoPHDxGfmIDPP/8czs7OmcplmVht2bwZMdGRGNKrIwwN9IskYMobQRBw9MwlXL19H9988w2MjIw0HRIRERERFYDExET8/vvvqFa5Clr7+EIkEmk6JPpAWloaduzbjcTk9xg/fnymf59M6W9SUhKev3iBejW8mVQVQyKRCD4NakJLSwsPHz7UdDhEREREVEAePXoEAGhSrxGTqmJIIpGgYZ36iI2NxevXrzPtz5RYRUREQKFQwMXRrkgCpPzT09WFrZVFlv+gRERERFQyhYeHw9LCEgb6HNworpwdnQAgb4lVxhuYpVL11rX/FC5V6mHcpCmF0rb/waNY/tf6fNebMfd3XLxyreAD+sDg0RMgMnHI9DkacDrbOlIdCd+WTURERFSKyGQySNV8t9Sn8qxeGRO+nVgobe8/dACr1vyV73qz583BpauXCyGi/yxcsgj1fBrBprwjzJ1sUKtxPaz4a1W2KzGKxWKIxWKkpqZm2pft2t2aGHzcu3kNzExNCqVt/0NHcf3WXYwZMThf9Wb+uhCGBgZoWK9OocSVwdXFGVtWL1HZ5uXpkUMNDg8TERERUcHYuXErTE1NC6Xt/YcP4ubtW/hi2Ih81fvlt7kwNDBAg7r1CyUuAHgbF4ceXbqhslclSKW6OH32DL75YTLexb/Dd99MzrKOKJv78GLxUqSkpCTo6emhRjVvTYeiMXp6uqhfp5amwyAiIiKiMiTjPrx61WqaDkUjZk2drvK9pU9zhISFYtP2LdkmVtnJ89qN67fsgLZ5ObyJiFTZHhMTCx1LF6xauwmXrl5Hpz6DYV+hJgzs3FG9cSts2u6nUv7MuYsQmTjg0LEA9BgwAsaOFdBz0BcAMk8FzE97J06dRb9hY2Hk4Aln77r4bdFyZZnBoydgw9ZduP/wsXKa3eDRE3I9Z5GJAwBg8rTZynpnzl0EAPy+ZCXq+LSDiVNFWLtVRYdeA/Ek8FmmNlat3QRn77rQt3VDq859cOvOPYhMHLB+y45cj09EREREtHHrZhhYm+JNRITK9pjYGBjZmuOv9Wtx+doVdP+8F8pX8oC5kw3qNmuILTu2qZT/+/w56FoY4cjxo+g7uD+snO3Rb+gAAJmnAuanvYDTpzBw5FBYlrODR7VK+H3xH8oyw8d+gc3bt+LBo4fQtTCCroURho/9Itdz1rVIX/n6h+lTlfX+Pn8OALBo2WI0atkM1i4OcKpQHl379sDTwKeZ2vhr/Vp4VKsEM0drtOvWCbfv3oGuhRE2bt2c47EtzMyRmpr/R27yPGLVtUNbjPr6B+zyP4hxI4cot+/efxgA0LNLBxw/dRaN6tXBqKEDoCuV4sKVaxg2bhIUCgUG9eul0t7Ir75D/17dsHfYQIjFWb8jK/hlWJ7bG/X19xjQpzv2bl4N/0PH8N30X1DV2wttfJtj2uQJiIyKxqMnz5TT7awsLHI950sB+9HAtxPGfzEU/Xp2AQBUquAJAAh9FY5xIwfD2ckR7+ITsHLtJjRs1RlPbpyDubkZAGD/4eMY9fX3GD6wH3p0bo/b/9xHr8FZX0iBz4Ng4lQRSUnJqFKpIqZ9OwFdOrTJNUYiIiIiKt06d+iI8ZMmYM++vRg94r97yb0H9gEAunfugoDTp9Cgbn2MGDwMUqkuLl29jFFfjYVCocCAvp+rtDf26y/Rt2dv7Ni4Ndv78JchIXlub/ykCejXqw92bNyKA4cPYsrMn1Clsjc+a9kKP0z6DlHRUXj89CnWr1wNALC0tMz1nP8+ehLN2rTEmBGj0Lt7TwCAV4WKAICwV68wavhIlHMqh/j4d/hr/Vr4tPXFP1dvwdzMHABw8MghjJ/4FYYMGIRunbrgzj938fmwQdkeTyaTISkpCecvXcCWHdsw5dvvc43xY3lOrExMjNGuVQts8/NXSay2+fnjsxZNYW5uhj49Oiu3C4KApo3qIzQsHKvWbc6UCHVq2wrzZuW8UEV+2uveqR1m/JCeZbf0aYJDx07Cz/8Q2vg2h5urC6wsLRAcEpav6XYZZcs5OmSq98fcmcqf5XI5WjVvAmv3avDbdwgjh/QHAPw8/0+0aNoIfy2ZDwBo7euDNFkapv08X6WtGlW9UadmdVSu6Im3ce+wYs1GdP18GHZtWIUeXTrkOV4iIiIiKn1MjE3Qxvcz7NizSyWx2rnbD77NW8DczBy9uvVQbhcEAU0aNkLYqzCs2bA2UyLUvk07/DJjdo7HzE97XTt2wrTvfgQAtGjmgyMnjmHPfn981rIV3Mq7wtLCEi9DQlCvTt08n3NGWSdHx0z15v/yq/JnuVyOlj4t4FTBFXv2+2P4oKEAgF9/nw+fJs2wYtFSAECrFr5IS5Nh5tzM5/3s+TNUrlNd+f37id/iy9Hj8hxrhnw9Y9W3R2f0HjIaL0PCUM7JAeGv3+DvC5excdWfAIDY2LeYPvd37Dt8DGGvXkMulwMALP4dwflQ+9Ytcz1eftr7rEVT5c8ikQheFTwQ+io8P6eXL5ev3cC0n+fj5p1/EBP7Vrn9SeBzAOn/yLfu3sOCn6ep1OvcrnWmxOqr0cNVvndq9xkatuqEn+YsYGJFREREROjVvSf6DxuEl6EhKOfohPDXr3Hu4nmsWf4/AEDs21jM/nUODhw5hFfhrz64bzbP1Fbbz1rnerz8tNfS57/7epFIhIqeFRD26tUnnWdeXLl2FTPn/ozbd28jJjZWuT3wWSCA9Pvw2//cwa+zflGp17Fd+ywTK0cHR1wI+BsJiYm4cPkiFvy5EFpaWvjp+/ytVp7nZ6wAoEMbXxgY6GP77vRhx517D0BXV4ou7dOnrA0e8zW2+flj0vhROL53K66dPoyh/fsgOSUlU1s2Vla5Hi8/7ZmaqK4mqKMjybJcQXgZEobPuvaDXC7HqkXzcOG4P66dPgxrK0vlMSOjoiGTyWBlqTrl0Noq96FPLS0tdO/UHg8fP0VSUlKhnAMRERERlRztPmsDA30D7NqTvt7A7n17oKuri07t0v8IP2LsKOzcswtfj/0SB/324ULA3xj0+YAs74etraxzPV5+2st0Hy6RIDkl+VNOM1cvQ0PQoUcXyOVyLF24GKePnMCFgL9hbWWF5OSM+/AoyGQyWFqo3ndbWWadf0ilUtSqURPNGjfBj5O+w6yp0zFv4Xy8fvMmX7Hla8RKT08PXdq3xvbd+/DthDHYvnsfOrZpBQMDfSQnJ+Pg0QAsnDMd478YqqyjWK3Isq3c3iad3/aK0tGA00hISMSezath+u/y8DKZTGXkysrSAtra2oiMilapGxEZVZShEhEREVEpoKenh47tOmDX3t2Y+OXX2LXHD+1bt4WBgQGSk5Nx+PhR/PbzXIwZOUpZR7Hm0+/D89NeUTp+8gQSEhOwY+MWmJqYAsi4D/9v5MrK0hLa2tqIila9746MUl2ELzs1qlWHXC5H8Mtg2NrY5Dm2fI1YAUDfHl1w6+49HAs4g8vXbqJvjy4AgJSUVCgUCuhIJMqy8fEJ2H/4eH4PUSjt6Uh0lFlsfkiyyLiTkpMhEokg+SC2nXsPQCaTKb+LxWLUqOqNfYeOqdT1P3Q012MqFArs8j+Iyl4VoKenl++YiYiIiKj06d29B27fvYMTpwJw5fo15XNQKakp/943//dy4fj4eBw6eviTjlPQ7eno6HzSTDKJRJLp/j054z5c+7/7cD//PZnuw6tXqYYDhw+p1N1/+GCejnvx8iWIRCK4OLvkK958v8eqVfOmsDA3w9BxE2FqYoK2rZoDSF/cok7N6vh10TLlaM2vfyyFibExIqLyP0pT0O15VXDH2s3bsc3PHx6u5WFpYQ4XZ6c81dt36DiaNKgHA319VPBwQ4umjQAAQ8Z8jS+G9Mf9R0/w+9JVmYZBp07+Cp37DsGI8ZPRs0sH3Lp7Dxu2pg/famml57TBL0MxaPQE9O3eGe6uLoh9G4cVazbi+q072L0p/2+oJiIiIqLSqaVPC1iYm+OL8WNgamKK1r6fAUhf3KJ2jVqY/+dCWFpaQlssxoI/F8LY2CTPozQfKuj2KnpWwIYtm7Bj9y64u7rBwsICLuWc81Tv4JFDaNSgIQz09eHp7gGfJs0AACPHj8awQUPx8NFDLFq+RDl6leH7iZPRo38fjJ4wDt06d8Wdu3eweftWAP/dh8e9i0Pn3t3Rr2cfuJZ3hUyWhrMXzmHpqhUYPmgobKxznzL5oXyPWEkkEvTo3B6vwl+je6d20NH5L5Pdunop3Mu7YNDoCfjyu2no0bkDBvbtkUNrOSvI9oYN6IueXTpg/OSpqNO8HWb8+nue6i1bMAcKhQJte/RHnebtcOP2XVSp7IX1K/7Ajdv/oEPvwdjm5w+/Df+DiYmRSt1O7T7DioVzcezUGXTuNxRHAk5jxcK5AAATY2MAgJGhAUyMjfDzgj/RrudADBn7DRSCAkf8NqNrx7afdK5EREREVPpIJBJ07dQFr16Ho0vHTir34Rv+twZu5V0xfOwX+OaHb9G1Uxd83rvvJx+rINsb/PlAdOvcFd98PwmNfJvh53lz8lRv0W+/QyEo0Ll3NzTybYabd27Du1Jl/LV0JW7euY1u/Xpi555d2LZuE4z/vbfO0KFteyxZsAgnTp1Ez/59cOzkCSyen/5+rYz7cF2pLjzc3PHniqXoOaAPho4egXMXLmDJgkVY9FvecoUPiQRBED7cEBgYiC1btuDr4X1hbGSY7wYpZ2s2bsPw8ZPw4u7lPI2YZWej32EYmJihe/fuBRgdEREREWmKv78/oiOj0L97H02HUiqt27wBo78ah0e37uVpxCw7C5b/Cd9WvqhXr57K9nxPBaS8i4mJxcx5f6BF00YwMjTAtZt38Mvvi9G5fWu1kioiIiIiIspeTGwMfvntV/g0aQpDQyPcuHUD8xYuQMe27dVKqnJSphOrDx9y+5hIJMr2TdR5JZFI8OxFELbu2ou3ce9gZWmBAb27Y97M/K2JT0RERERUmhT6fbi2BM+DnmPH7p14GxcHKwtL9OvVB79Mn6VWuzkps4lVUHAIyletn+3+Zo0b4MwhP7WOYWRkiIM7N6rVBhERERFRaRL0MhgVa3hnu79Jo8Y4sf+IWscwMjLC3m3q3cvnV6bEKmNde4VCyFS4NLG3s8G109kvGWlkZFCE0eSfQlDk+g4CIiIiIio5RCIRBIXm3xVV2Oxt7XAh4O9s9xsZFt91HgRBgEJQKFcW/FCmxMrAID2hePsuHqYfrXJXmujo6KB2zWqaDuOTCIKAt+8SYOtQTtOhEBEREVEB0dfXR1x8PARBKNV/QNfR0UGtGjU1HcYnSUhMhEKhgL6+fqZ9mVItGxsbGBkZ4f6T50USHOXfqzeRiHsXDw8PD02HQkREREQFxMPDAwmJCQgND9N0KJSNR4GPoaWlBVdX10z7MiVWIpEINWvWxPW7D3H55j2k5fBgGRUtQRAQ9joSuw6dhLm5GVxcXDQdEhEREREVkHLlysHS0hL7jx1G2OtX+OitSKRBCoUCj54+xt8Xz6NSpUrQ09PLVCbTe6yA9Bv4w4cP4/r165BItGFrZQFtsTZK8oikQpH1XMiSQqFQIPZdAuLexcPc3AwDBw6CiYmJpsMiIiIiogIUHx+PDRs2IDo6GsZGRjAxNoG4mN7DKhQCtLRyTxByKpfXNjRJJpcjOiYGSclJcHd3R+/evaGtnXkNwCwTqwzR0dF48OABoqOjc1wSsSQ4d+4cmjRpoukwPpmWlhb09fXh4eEBFxcXtZegJCIiIqLiSaFQICgoCE+ePMH79++hKKYLWuT1/jqnciXhHl0sFsPExAReXl6wtbXN9vm3HBOr0qRTp07Yv3+/psMgIiIiIioV8np/nVO50nSPXjzHFYmIiIiIiEoQJlZERERERERqyvzUFRERERERqe3t27cIDAxEYmJiqVrhT0tLC6ampnzm/yNMrIiIiIiIClBKSgp2+e3Cs8Bn0NLSglRPr0SvTv0xuVyG5PdJqFa9Gk6cOAFfX99S/ULjvCoTidX79+8hl8s1HQYRERERlXKCIGDr1q14/eYNfLp0gHNFD+hIpZoOq8AlvovHo5u3cfHvCxCLxWjRokWW5WJiYkrVaF1OykRi1aB+PQQHB2s6DCIiIiIq5cLCwvDy5Uu07tsD5TzdNR1OoTEwNkItnyaQpclw9epVNG3aNNO7neLj42FhYQEPDw8NRVm0Ss+YZA7u/nMPce/iNR0GEREREZVyjx49gp6BPhzdXTUdSpHwqOaNlJQUBAUFZdqXlJQEAHj9+nURR6UZZSKxIiIiIiIqCvHx8TCxMC9Vz1TlxNTSAgDw7t27bMuUlUdyysa/OBERERFREVAoFNAqQ6vlaWlpQSQSQaFQZFuGiRUREREREZGamFgRERERERF9IOO5qfyQyWSFEEnxU6YSq7Lyj0pERERExc/j+w8woF1nVLawh5u+GZpUqILlv/0OAJgweDhaeNfEqSPH0MK7Jlx1TdCmVgPcuHxFpY1dGzejS+PmqGxuh0pmtujh0wq3rl7LdKynDx9heLfeqGxuBzd9M/hWqwP/bTuU+wVBwMoFf6CxpzfKS43RwLUi/vfHYpU2fp8xGx6GFrh19Ro6NmgGV10TbFi2shB6pnQoE8utZ3j//j2MjY01HQYRERERlUGDO3aDpY0Nfl+zEkYmxggKfIbw0DDl/jfhr/HjmC/xzYypMDUzw9JfF+Dz1h1x/uk9WFpbAwBCg4LRY+DncHZzRVpqGvy37UD3pr44cfc63DzTlzV//jQQnRo0g72TI2Yt/h1WtjZ4fO8Bwl6GKI/101cTsXX1Onw55TvUqFcHNy5expzvpkBXTw8DR41QlktLTcW4foMw4usv8f2cmTCzsCii3ip5ylRilZiYyMSKiIiIiIpcTFQUXr4Iwsw/f8dnHdsDABo191Ep8zYmBqt2bUHjFs0BAPWbNUEdJ3f89cdi/DD3ZwDA1z9NUZZXKBRo2qolbl+9jp3rN+KHObMBAAtnzIZERwf+F07D6N9736a+LZX1gp49w7qlK/DryiXoP3K4cn/S+yT8MfMX9B85TLmqYVpaGr79ZSY69+5Z8J1SypSpqYCJiYmaDoGIiIiIyiAzCws4OpfDrz9Mw84Nm/AqNDRTGWMTE2VSlfG9iW8L3Lzy31S/pw8fYVjXXqhmUw5OYn04Swzx7PETPH8SqCxz/uQZtO/RVZlUfexcwCkAQLvuXSGTyZSfxr7NEfH6NV6FhKiU923fVq1zLyvKVGKVlpam6RCIiIiIqAwSiUTYevwQ3L0qYsrYCajj5I62tRvi8tlzyjLmVpaZ6lnaWCMiPP0Fuwnx8ej7WXuEBr/E9IXzsPfcSRy+dgGVqlVFSnKysk5sdDRs7O2yjSUmKhqCIKCKpQOcJYbKT99W6SNpr0L+S/r09PVhYGio9vmXBWVqKqC+vr6mQyAiIiKiMsrN0wP/27UVaWlpuH7xEn798ScM7tgdN8KeAwBiIqMy1Yl6EwFrO1sAwI1LlxEeGoYNB/eicrWqyjLxcXGwc3RQfjezsMCbV+HZxmFqbg6RSIS9509BR0cnc5wVPJU/i0Si/J9oGVWmRqwMDAw0HQIRERERlXESiQQNmjXF2O8nI/7dO7z+Nwl6FxeH86dOK8u9i4vDuYBTqFmvDgAgOSl9VOrDZOjaxUsICQpWab+Jb3Mc8tuLhPj4LI/fuGX6dMPY6BhUq10r08fQyKjgTrYM4YgVEREREVEhe3D3H8ya+B069e4BZzdXxMe9w9K58+Hk4gwXN1cA6SNJk4aNwsSZ02Biaoqlvy6AIAgYPmE8AKBm/bowMDTEj2O/wrjvJ+F12CssmD4btg4OKsf6evpUBBw8gi6NW2DMt9/A2s4WTx88QtL79xjz7US4eXpg8NhR+GrAUIya/DVq1KsDWZoMz588xcXTf2Ot/64i75/SoEwlVnp6epoOgYiIiIjKIGtbG1jZ2mDp3Pl4HfYKRiYmqNukERZvXgexWAwAsLGzxY/zfsHPk39A8LPn8KxcCVuOHYCVjQ0AwMrGBqt2bcXsSd9jaOeeKO/pgXmrlmH5vAUqx3L1cMe+i2cw94dp+HHMV5DJZHD19MDY7ycpy8xevBBuFTyxedVqLJo1B/qGhnCr4IkOPbsVXaeUMiJBEARNB1HYMuaGloFTJSIiIiIN2r17N2IS49F+YN981ZsweDjuXr+JU/duFlJkhWf1rHlo164dateurbI9IiICNjY20NHRQUpKSpZ1O3XqhP379xdFmIWuTD1jRURERERERausrHPAxIqIiIiIqCCVoVlSeZkRVlYSqzL1jBURERERUWGSSCRIzWbaW04WrV9dCNEUPllaGgRByHLZ9gxlZQE5jlgRERERERUQe3t7RL+OQFLie02HUiTCngcBSD/v7FhbWxdRNJpVJhKrZcuWYdWqVZoOg4iIiIhKOS8vLwDAtZNnSv3CaSnJybh19gKsra1haWmZab+FhQVGjhyJhQsXaiC6olcmVgUkIiIiIioqt2/fxr59+2BpawOXShVgZGoKLa3SM54hl8kQ/SYCT+/eA+QKDBw4EHZ2dp/UVmlaFZDPWBERERERFaDq1avD0NAQN2/exJ1zl5CWlqbpkAqcgYEBQoKCMX36dFhZWWk6nGKBiRURERERUQFzd3eHu7s7BEFA2r8LPBS23r17Y8eOHYV+HLFYDG1tbXTq1IlJ1QeYWBERERERFRKRSJTjinkFSaFQQCqVFsmxKLPSM9mTiIiIiIhIQ5hYERERERERqYmJFRERERERkZqYWBEREREREampWC9eERYWhqSkJE2HUaxJpVI4OTnlqawgCAgODoZMJivkqIiIiIjKDmdnZ0gkkjyVjYiIwLt37woljsTERAQGBuaprL6+Puzt7fNUVqFQ4MWLF8qVDc3NzWFubp5l2fycn7GxcZ7KlRhCMQaAnzx8hgwZIqSkpOTan1u3btV4rPzwww8//PDDDz+l7eNdtZoQHByc671YdHS0oK2trfF4Mz7fffedIJfLc417wYIFKvV0daXCtWvXhI4dO6qUi4yMFMRicZ6Pr6WlJVSqVElQKBS5xlASFOsRKwBYPgqwNtF0FMXXo1Bg1uYNeBb4BLv3+MPS0jLbsrGxsdASi9F09q4ijJCIiIio9JIlJeDOX1NRq3Yd7N/njwYNGmRbNiEhIX3mUK+mgIttEUaZhcBXmPfbb3jw8CG2btkCQ0PDbIvGxsbC2koPa5ZVBABMn/MCvXv3gKenl0q5hIQEyOVyoOkQwMY91xAUz6/jwY29GDZsGFauXFlky9IXlmKfWNV2AxyzzxXKvIYVgSrOCgxbehl169TEgYNHULly5WzLi0RasK7aqAgjJCIiIirdzCvWwuU5Q9HMxwdr16xB//79c67gZAVUzNujHIWmohPgaIlDfx1B/YYNcfjgQZQrVy7b4nq6YjRpaAYAWLtMF03b3EBKcioEQYBIJFItbO0KOFXJPQanKoCVMzZsWoYnTwPhv3dPjoMExV2pWrziyE1g/SlNR5EuJAqwHwIcvJa/ehcfpde78+K/bd1/Td/28edpePr+Oh7A4Wly6MpfoUH9ujh06FDBnQgRERER5UjXxBJNft4Fx6bdMGDAAPzwww9QKBSaDit31d2g+KEPHr0OQY3atXDp0qU8VXNx1sPi+Z4IexWO//3vf+rFUKkFFD1+xuXb91Czdh3cv39fvfY0qFQlVsduAhuKSWJlbQIcmAo08sq97IeqOKfX8/joWcI6HunbP/w4fZDQO1oC+36Qo4FHEjp27IiFCxcqHzAkIiIiosIllkhR+6tFqDZ0BubNm4eu3bojISFB02HlztES8il98dZcD019mmHTpk15qtalgzWGDXTAV199ibt376oXg70X5H3m41WSCHXr1y+xgwSlKrEqSEmp6tWXSoBaboBZ9tNVs2Skl15PX6q63UQ/ffuHH92PFp8x1APWjhMwpq2AiRMnYtiwYUhNVfNEiIiIiChPRCIRKnQbjUY/bcLREwFo2KgxXr58qemwcmesD8XE7pDVrYCBAwfmecTtl5/c4eGmi169CiCJNLaGvNevSLKtXGIHCUpNYjVhNbDzAvA47L+pchNWp++7Hgj0nAe4fQFUGAOMWQlEfbAKZMa0vR3ngUnrgMrjgPaz0vfZDwGWHgJ+3Q1U+RKoOAaYvRMQBODcA8D3J8B9FNDrNyAsOnObH04FrDsJ+HETsO4kUGdSeixDFgPRH8SS1VTA/NDSAqb0BP4cDmzZvAGtfFsgKirq0xojIioEz45sxMEhtbC7uwv+ntoTsc/+wc4ONngRsF1Z5kXAdhwb5wO/ruVwYGA1/LNxDhRyucr+nR1sEPvsH5yd3he7u7vg8Ij6CDq5M9PxXl07gYBv2mB3N2fs61cJN5Z9C1lyYpGcKxGVTfZ1WqH5bwcRFBGL2nXq5nmKnUZpi4EhnwG9muHXefPQtVu3XJMlXV0x1q3wQmhIMMaOHaN+DDp6EDr+AKF2txI5SFB6EqtOQMuqgLPVf1PlJnRKT6p6zAOM9IGVo4HfBqUnLUMWZ25jrl96wrRsFDCt93/b151MT5qWjABGtgZWHAFm7QBmbAPGt0/f/uw1MHFd7nEevw0cvwXM6Q/M6gdcfgxM2ZJ7vUuP0xPD8iOAbr+m18tJz0bArskKPLiTvqhFSZ6vSkSlR9iVo7ixbDJsajRDox/Xwbp6U1z6dYRKmcd7V+L64m9gW7M5Gv+0CRV6jMPTA6txb9OcTO1dWTAmva2pG2DqVgVXF32JdyFPlPtDzh/AhdkDYeLihYZT1qHqkGkIvXQI1/78utDPlYjKNhMXLzT//QhEVi7wad4cW7bk4YZP00QioE1tYHxnHDx+FPUbNsx1xM3DzQAL53pg48ZN2L17dwHEoAU0GQS0mYANmzajeYuWiIyMVL/dIlDsVwXMKxdrwMIICI1OnyaX4Zu1QFUXYM249GsFALwcgebTgJN3gJbV/itbuRzw+9DMbduaAUtGpv/sUyU9OfrfceDMz/89CxUeC0zdAsS9T5+2lx1BANZ/lT5VEEgf2VpyEFAo0kebslK/AtCjIeBqC7yOBVYeBXrPB3Z/D9TOYSXLjEUtBi1OX9Sid59+2RcmIioCD7f/AeuqjVHny4UAANtazSHI0nBv8zwAQNr7BNzf+hsqdB+LqoOmpJep0Qxa2jq4s2Y6KnQbC6nxfy+ldO8wFO7thwAALL1qI/zaCYReOIhKfb6BIAi4s3YmnJp0Rp0v/1DW0TW3wbkZ/VCpzzcwca5YVKdORGVQxqIWN5d9i/79+2PkyJGaDilvMha1WLIPNWrXQtvPWiOnSXm9u9vi7MVYTJs2peBiqNQCClM7XDkwFzVq1cYev12oW7duwbVfCErNiFVW3qcA154CHesAcgUgk6d/XG0Be3Pg9kfT7VpWzbqdppVUv7vaALamqgtMuP77KoLwmJxjalDhv6QKADztgTQ5EBWffZ3JXYG+TYF6nkDneukJlY0psGh/zscC/lvUop7be6xZvRpyuSz3SkREhUAhlyP2+T3Y12utst2+fhvlz9EPr0GWlAinxp2gkMuUH5vqTSFPSUJc8COVujY1fJQ/a+sawMDaCe+j0pdMjQ97hvcRIZnasvJuAJFIC7GBdwrtXImIMmQsalF16HT1V9ArSv8uahFjLMGWLVuQlpbzlLzfZnvC3lYLYjEAeVrBxGDvBXnfBQhLFFCvfvbvBysuSs2IVVbi3qcnVNO3pX8+9uqjJMgqmxcRG380AqWjnfU2AEjJ5Tr6uJ4kj/U+pC9NH2k7dD1v5Q31gEldgLP3AZmiZD0ESESlR8q7aAhyGaQmFirbdU0sVcoAwImvfLNsIynqlcp3HQNjle9a2hIo0lIAAKnv0n/JX/hlSJZtvY8My0f0RESfTiQSwbV1f7w4vhXxoU81HU7eGesDnRsCC3Of4megL0bn9tZYuDQYeP+2AGOwBup2B44uKrg2C0mpTqxM9NOn/33ZHmhTM/N+cyPV76LMRUqFw9eB8au1YGNri7DwCE2HQ0RllNTYAiKxNlLiolW2J8f9t8COjlH6yycb/rgO+lYfvXcCgIFN9i+v/JiOkSkAoMaoubCokPl/Anrmtnlui4hIHfGvXuDS7AFQxL3WdCj5c+YuRFtOwbFcOYiEnKdlXbkehz9X/Ps8lpFVwRxfEICru4ALm9GlS9eCabMQlarESqKtOvKjL01/3uppOPBdec3FVdDepwABd4DquZyTIACLDwLz9gA9e3RFw0aNMWnyt0UTJBHRR7TEYpi5euPVlaPw7PzfcwZhl44qf7aoWBtiqR6Sol/BsWE7tY5n5OgBPUt7JL4OhkeHLB6gJSIqAhF3z+Py3GFwsLXG6v370bx5c02HlDu5AthxBgi4hdFjxsDY2BjbtizNtnhsbBqGjXmIGjWq4/r1WwUTQ1oKcGIp8OhvTJ8+HdOnTy+YdgtRqUqsPOyA7eeAvZfTn4MyNwKm9UpfCv2L5UCXeoCJQfpzUGfvA72bAA2L+XPLV54Ay48AbWumvxD4zdv0xSsi44D/5bCqZVJq+tLxey9DeTGuWLGiyOImIsqKV5+vcWH2IFxb/A2cGndC7PN/EHxqBwBAJNKCjqEJvD//DnfXzcb7qHBYV2kIkZYYCa+D8erKUTT8YQ20dXNYIegDIpEI1YfPxOX5oyFLfg/7Or4Q6+rjfUQowq8FoMqgH2Hk4JZ7Q0REn+jZkY24tfIH+Pj4wG/XTsTH5/BQfXHxPhmiVYchevASS5Ytw5gxYzB16tRsiwuCgDETHyExSYJFi5agcePG6seQGAvxgTkQRwdj044d6NWrl/ptFoFSlVj1bQrcepG+Ol9sAtCrEbBoOOD/I7DAH/h6DZAqB+zNgMaV0lcSLO6sTYA0Wfp7tGIT0kfharsD8wYBNVyzrhMRBwxdIsaDUDF27NhUYi5GIir9HOq1Qc0xv+HRrj/x8sxumHvWRM0xv+HstF6QGKTPz67QbTT0LGzxxH8lAg+ugZZYGwZ2LrCv0wpaEp18Hc+pcSdIDEzwcMciXD7jBwDQt3aCba0WkJoW0FQVIqKPKOQy3Fk9HU8PrMaYMWOwaNEiSCSS4p9YvYmFeMl+6CemYs/Ro/D1zfp51w+tXBOKI8cjsW/fPjg4OKgfQ8RziA/8Ags9bRw6fw61a9dWv80iIhKK8SuNRSIRrs5PX9mO8uafYGDIEjEEiTn27T+EOnXqKPctX74cX341Ad39QzUYIRGRqufHt+D64m/Qfs21fD1DRURUHKUmxOHK/C8QcfsslixZgtGjRyv3vXz5Es7OzsDE7kBlF80FmZWHLyFeeQjOtvY4cvAQPD09lbumTp2KzRv/xJ1Lqsud37rzDp91uYnRo8fhzz//RFBQEMqXLw/0mA2Uq/bxEXL39CK0ji1ClcqVcOjA/oJJ1IpQqRqxKusOXwe+XK0Fr8re2Lf/UIm7GImo9EuJj8WDrQtgXa0xtPUMEfP0Nh7uWAT7+m2YVBFRiadcpOJdJI7mccSnWPh3kYqmPs2we5cfzMzMcq0S906GIWMeolq1apg/f756x/9gkYpuPXtiw/r10NfP27Tv4oSJVSnw8SIV6zdsLJEXIxGVflpiCRJeB+Hl33uRmhgHqYkFnFv0RNXB0zQdGhGRWj5cpOJwwBWVEZ9i66NFKjKmLOZGEARM+O4xYmJFCDi5Czo6+ZumrSKLRSpEopK5VjcTqxIuq0UqSurFSESln0TfEE2mb9F0GEREBerjRSryMuKjcVksUpFX67e8wt4Db7Bz5064uamxCFAJXaQiO8U+sfpmHaCnRhJc2gVFaCEkWjvPi1TIZWk4P2tAEURGREREVPqlJSUg8p+LKotU5GrvReBkAS1L/onEr2Kgn6LI8yIVEVFJ6DPkLgDg9NlYjBo1Cj179sy68IXNwM0DuccQ+Qzm+hIcOndWZV2AkqpYL17Ru3dvJCUlaTqMYk1HRwffffddni7GBw8eYOrUqZDJZEUQGREREVHZ0LlzZwwbNizXcmlpaRg5ciSio6NzLVvYDAwMMHPmzDxNWbx48SLmzZuHjLTB0dERCxcuhK6urkq5tLQ0jBgxAjExOb9MOIOZmRnmzJlTatYFKNaJFRERERERUUmgpekAiIiIiIiISjomVkRERERERGpiYkVERERERKSmYp1YyWQynDlzhostFDD2a+Fh3xYO9mvhYd8WHvZt4WC/Fh72beFgv5YdxTqxksvl+PvvvyGXyzUdSqnCfi087NvCwX4tPOzbwsO+LRzs18LDvi0c7Neyo1gnVkRERERERCUBEysiIiIiIiI1MbEiIiIiIiJSU7FOrMRiMZo1awaxWKzpUEoV9mvhYd8WDvZr4WHfFh72beFgvxYe9m3hYL+WHSJBEARNB0FERERERFSSFesRKyIiIiIiopKAiRUREREREZGamFgRERERERGpiYkVERERERGRmrQ1HUBWgoODcfHiRbx69QoJCQno3bs3KlasqOmwir3c+s3f3x937txRqePm5ob+/fsrv589exZPnz7F69evIRaL8f333xdZ/MXVtWvXcP36dbx9+xYAYG1tjaZNm8LDwwMAIJPJcOzYMdy/fx8ymQzu7u5o164dDA0NlW0cOXIEISEhiIiIgKWlJUaNGqWJUynWzp8/j5MnT6JevXpo06YNAGD9+vUIDg5WKVerVi106NBB+Z19m7V3794hICAAgYGBSEtLg7m5OTp37gx7e3sAgCAIOHPmDG7evInk5GQ4OTmhffv2sLCwULbB3weZLVq0CHFxcZm2165dG+3bt+c1q4aUlBScPn0ajx49QmJiImxtbdGmTRs4ODgA4DWbV7ndC+SlH7O6zlu2bInGjRsDSP//3sGDBxEeHo7IyEh4enqiT58+RXOCGpJTv8rlcpw6dQqBgYGIjY2FVCqFq6srfH19YWRkpGwjKSkJR44cwePHjyESieDl5YW2bdtCR0cHQNns19KmWCZWqampsLGxQfXq1bFz505Nh1Ni5KXf3N3d0blzZ+X3j5f+lMvlqFSpEhwdHXHr1q1CjbekMDY2hq+vL8zNzQEAt2/fxvbt2/HFF1/A2toaR48exdOnT9GzZ09IpVIcOXIEO3fuxNChQ1XaqV69OsLCwvDmzRtNnEaxFhYWhhs3bsDGxibTvpo1a6J58+bK7xKJJFMZ9q2qpKQkrF27FuXLl8fnn38OfX19xMTEQFdXV1nmwoULuHLlCrp06QIzMzOcPn0amzdvxtixY6Gtnf6/Bv4+yGzEiBH4cDHdiIgIbNq0CZUrV1Zu4zX7aQ4cOICIiAh07doVRkZGuHv3LjZt2oQxY8bA2NiY12we5XYvkJd+BAAfHx/UqlVL+T3j5h8AFAoFtLW1UbduXTx8+LBwT6iYyKlf09LS8Pr1azRt2hQ2NjZITk7G0aNHsW3bNowcOVJZbs+ePYiPj8eAAQOgUCiwb98+HDhwAN27dwdQNvu1tCmWUwE9PDzQokULeHl5aTqUEiUv/SYWi2FoaKj86Onpqexv3rw5GjRokOUNbllVoUIFeHh4wMLCAhYWFmjZsiV0dHQQGhqK5ORk3Lp1C61bt0b58uVhb2+Pzp07IyQkBKGhoco22rZti7p168LMzEyDZ1I8paamYs+ePejYsaPKjX8GiUSics1KpVKV/ezbzC5cuAATExN07twZDg4OMDMzg5ubm/KPA4Ig4MqVK2jatCkqVqwIGxsbdOnSBfHx8Xj06JGyHf4+yMzAwEDlenzy5AnMzMzg7OysLMNrNv/S0tLw4MED+Pr6wtnZGebm5vDx8YG5uTmuX7/OazYfcroXyGs/AoBUKlW5jj9MrHR0dNChQwfUqlVLZXZGaZZTv+rq6mLAgAGoXLkyLC0t4ejoiLZt2yI8PFw58hcZGYnAwEB06tQJjo6OKFeuHNq2bYt79+4hPj4eQNns19KmWI5YUeEJCgrC/PnzoaenBxcXF7Ro0QL6+vqaDqvEUCgUePDgAdLS0uDk5ITw8HAoFAq4uroqy1haWsLExAQhISFwdHTUYLQlw+HDh+Hh4QFXV1ecPXs20/5//vkHd+/ehaGhITw9PdGsWbMsRwDoP48fP4abmxt27dqFoKAgGBsbo3bt2sq/Pr99+xYJCQkq162uri4cHR0REhICb29vTYVeosjlcty9excNGjSASCRSbuc1m38KhQKCIKiMmACAtrY2Xr58yWu2gOSnH8+fP4+zZ8/CxMQE3t7eaNCgAbS0iuXf44ullJQUAFD+wTA0NBS6urrK6dgA4OrqCpFIhNDQUA4mlBJMrMoQd3d3eHl5wdTUFLGxsTh58iS2bNmCYcOG8ZdlLt68eYM1a9ZAJpNBR0cHvXv3hpWVlXIe/8cjLQYGBkhISNBQtCXHvXv3EB4ejhEjRmS5v0qVKjAxMYGRkRHevHmDgIAAREdHo3fv3kUcackSGxuL69evo0GDBmjcuDFevXqFo0ePQiwWo3r16spr08DAQKWegYEBEhMTNRFyifTo0SMkJyejevXqym28Zj+NVCqFo6Mjzp49CysrKxgYGODevXsIDQ2Fubk5r9kCktd+rFevHuzs7KCnp4eQkBCcPHkSCQkJaN26dZHGW1LJZDIEBASgSpUqyhHrhISETP2upaUFPT093i+UIkysypAP/xJlY2MDGxsbLF68GEFBQSp/vaLMMh4wT05OxoMHD+Dv74/BgwdrOqwSLS4uDkePHsWAAQMy/ZU6w4fz+21sbGBkZISNGzciJiZGOa2NMhMEAfb29mjZsiUAwM7ODhEREbhx44ZKEkDquXXrFjw8PFQeTuc1++m6du2K/fv3Y+HChRCJRLCzs4O3tzfCw8M1HVqZ06BBA+XPNjY2EIvFOHjwIFq2bJnt72tKJ5fLsWvXLgiCgPbt22s6HCpi/K+jDDMzM1M+1M7EKmdisVh5U2Rvb49Xr17h8uXL8Pb2hlwuR3JyssqoVWJiIudH5yI8PByJiYlYtWqVcpsgCAgODsbVq1cxderUTCOpGauD8SY1Z0ZGRrCyslLZZmlpqXwYOuPaTExMVEkKEhMTy/SzKfnx9u1bPH/+HL169cqxHK/ZvDM3N8fgwYORmpqKlJQUGBkZwc/PD2ZmZrxmC8in9qODgwMUCgXevn0LS0vLQo+zpJLL5fDz80NcXBwGDhyo8nyloaFhptFVhUKBpKQk3i+UIkysyrB3797h/fv3Kr9cKW8EQYBcLoednR20tLTw/PlzVKpUCQAQFRWFuLg4ODk5aTjK4q18+fIYPXq0yrZ9+/bB0tISjRo1ynJ66uvXrwGA12wunJycEB0drbItOjoaJiYmAABTU1MYGhri+fPnsLW1BZD+PEBoaChq165d5PGWRLdv34aBgQE8PT1zLMdrNv90dHSgo6ODpKQkBAYGolWrVrxmC8in9uPr168hEokyTWWj/2QkVdHR0Rg0aFCm59cdHR2RnJyMV69eKZ+zevHiBQRB4PPYpUixTKxSU1MRExOj/B4bG4vXr19DT09PeWNAmeXUb3p6ejhz5gwqVaoEQ0NDxMTEICAgAObm5nBzc1PWiYuLQ1JSEuLi4iAIgvKmwNzcXGVFoLIkICAAHh4eMDExQUpKCv755x8EBQWhf//+0NXVRY0aNXD8+HHo6ekpl1t3dHRU+UUZExOD1NRUJCQkQCaTKfvVysoq05L3ZYVUKoW1tbXKNolEAj09PVhbWyMmJgb//PMPPDw8oK+vjzdv3uDYsWNwdnZW+csq+zaz+vXrY+3atTh37hwqV66MsLAw3Lx5U/kuJZFIhHr16uHcuXOwsLCAqakpTp8+DSMjI5X33fD3QdYEQcDt27dRrVo1lT8A8JpVT2BgIADAwsICMTExOHHiBCwtLVG9enVes/mQ2z1Ubv0YEhKCsLAwuLi4QCqVIiQkBMeOHUPVqlVVVhKOjIyEXC5HUlISUlNTlX2dkbCVNjn1q6GhIXbt2oXw8HD07dsXgiAon5vS09ODWCyGlZUV3N3dceDAAXTo0AFyuRyHDx+Gt7e3yh9eylq/ljYi4cMXchQTQUFB2LBhQ6bt1apVQ5cuXYo+oBIip35r3749duzYgfDwcCQnJ8PIyAhubm5o3ry5yhB0Vi8RBoBBgwbBxcWlMMMvtvbt24cXL14gISEBUqkUNjY2aNSokTIhzXhB8L179yCXy+Hm5ob27dur9GtWLw0FgK+++gqmpqZFdSrF3vr165UvBY2Li8PevXsRERGB1NRUmJiYoGLFimjatKnK9Ar2bdaePHmCkydPIjo6GmZmZqhfv77K8z8ZLwm9ceMGkpOTUa5cuUwvCeXvg6w9e/YMmzdvxrhx41T6i9eseu7fv4+TJ0/i3bt30NPTg5eXF1q0aKGcZs1rNm9yu4fKrR/Dw8Nx6NAhREVFQS6Xw9TUFFWrVkWDBg1Unq/K7mXZ06dPL7yT06Cc+tXHxwd//vlnlvU+vPaSkpJw+PBhPHnyJMsXBANlr19Lm2KZWBEREREREZUkXGObiIiIiIhITUysiIiIiIiI1MTEioiIiIiISE1MrIiIiIiIiNTExIqIiIiIiEhNTKyIiIiIiIjUxMSKiIiIiIhITUysiIiIiIiI1MTEioiIiIiISE1MrIiIiIiIiNTExIqIiIiIiEhNTKyIiIiIiIjU9H8ls5aDi0j55QAAAABJRU5ErkJggg==",
      "text/plain": [
       "<Figure size 1000x300 with 1 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "_ = targets.plot(ax_width=10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also look at the featue parsing specifications as a dict or YAML string (here we do it as YAML string).\n",
    "Note that all the defaults that were not specified in the YAML file above have now been filled in:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-05-23T18:11:31.346013Z",
     "iopub.status.busy": "2024-05-23T18:11:31.345645Z",
     "iopub.status.idle": "2024-05-23T18:11:31.356143Z",
     "shell.execute_reply": "2024-05-23T18:11:31.355441Z",
     "shell.execute_reply.started": "2024-05-23T18:11:31.345984Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "RecA_PacBio_amplicon:\n",
      "  query_clip5: 4\n",
      "  query_clip3: 4\n",
      "  termini5:\n",
      "    filter:\n",
      "      clip5: 4\n",
      "      mutation_nt_count: 1\n",
      "      mutation_op_count: null\n",
      "      clip3: 0\n",
      "    return: []\n",
      "  gene:\n",
      "    filter:\n",
      "      mutation_nt_count: 30\n",
      "      mutation_op_count: null\n",
      "      clip5: 0\n",
      "      clip3: 0\n",
      "    return:\n",
      "    - mutations\n",
      "    - accuracy\n",
      "  spacer:\n",
      "    filter:\n",
      "      mutation_nt_count: 1\n",
      "      mutation_op_count: null\n",
      "      clip5: 0\n",
      "      clip3: 0\n",
      "    return: []\n",
      "  barcode:\n",
      "    return:\n",
      "    - sequence\n",
      "    - accuracy\n",
      "    filter:\n",
      "      clip5: 0\n",
      "      clip3: 0\n",
      "      mutation_nt_count: 0\n",
      "      mutation_op_count: 0\n",
      "  termini3:\n",
      "    filter:\n",
      "      clip3: 4\n",
      "      mutation_nt_count: 1\n",
      "      mutation_op_count: null\n",
      "      clip5: 0\n",
      "    return: []\n",
      "  variant_tag5:\n",
      "    return:\n",
      "    - sequence\n",
      "    filter:\n",
      "      clip5: 0\n",
      "      clip3: 0\n",
      "      mutation_nt_count: 0\n",
      "      mutation_op_count: 0\n",
      "  variant_tag3:\n",
      "    return:\n",
      "    - sequence\n",
      "    filter:\n",
      "      clip5: 0\n",
      "      clip3: 0\n",
      "      mutation_nt_count: 0\n",
      "      mutation_op_count: 0\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(targets.feature_parse_specs(\"yaml\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## PacBio CCSs\n",
    "We will align and parse PacBio circular consensus sequences (CCSs).\n",
    "FASTQ files with these CCSs along with associated report files were generated from the PacBio subreads `*.bam` file using the PacBio `ccs` program, version 4.0.0, (see [here](https://github.com/PacificBiosciences/ccs) for details on `ccs`) using a command like the following:\n",
    "\n",
    "    ccs \\\n",
    "        --min-length 50 \\\n",
    "        --max-length 5000 \\\n",
    "        --min-passes 3 \\\n",
    "        --min-rq 0.999 \\\n",
    "        --report-file recA_lib-1_report.txt \\\n",
    "        --num-threads 16 \\\n",
    "        recA_lib-1_subreads.bam \\\n",
    "        recA_lib-1_ccs.fastq\n",
    "   \n",
    "Note that to make this example fast, we have extracted just a few hundred CCSs from the $>10^5$ typically produced in a single PacBio run.\n",
    "\n",
    "Here is a data frame with the names of the FASTQ files and reports generated by the PacBio `ccs` program:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-05-23T18:11:31.357522Z",
     "iopub.status.busy": "2024-05-23T18:11:31.357148Z",
     "iopub.status.idle": "2024-05-23T18:11:31.372207Z",
     "shell.execute_reply": "2024-05-23T18:11:31.371567Z",
     "shell.execute_reply.started": "2024-05-23T18:11:31.357494Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>name</th>\n",
       "      <th>library</th>\n",
       "      <th>report</th>\n",
       "      <th>fastq</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>recA_lib-1</td>\n",
       "      <td>lib-1</td>\n",
       "      <td>input_files/recA_lib-1_report.txt</td>\n",
       "      <td>input_files/recA_lib-1_ccs.fastq</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>recA_lib-2</td>\n",
       "      <td>lib-2</td>\n",
       "      <td>input_files/recA_lib-2_report.txt</td>\n",
       "      <td>input_files/recA_lib-2_ccs.fastq</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         name library                             report  \\\n",
       "0  recA_lib-1   lib-1  input_files/recA_lib-1_report.txt   \n",
       "1  recA_lib-2   lib-2  input_files/recA_lib-2_report.txt   \n",
       "\n",
       "                              fastq  \n",
       "0  input_files/recA_lib-1_ccs.fastq  \n",
       "1  input_files/recA_lib-2_ccs.fastq  "
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "run_names = [\"recA_lib-1\", \"recA_lib-2\"]\n",
    "libraries = [\"lib-1\", \"lib-2\"]\n",
    "ccs_dir = \"input_files\"\n",
    "\n",
    "pacbio_runs = pd.DataFrame(\n",
    "    {\n",
    "        \"name\": run_names,\n",
    "        \"library\": libraries,\n",
    "        \"report\": [f\"{ccs_dir}/{name}_report.txt\" for name in run_names],\n",
    "        \"fastq\": [f\"{ccs_dir}/{name}_ccs.fastq\" for name in run_names],\n",
    "    }\n",
    ")\n",
    "\n",
    "pacbio_runs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We create a [alignparse.ccs.Summaries](https://jbloomlab.github.io/alignparse/alignparse.ccs.html#alignparse.ccs.Summaries) object for these CCSs:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-05-23T18:11:31.373886Z",
     "iopub.status.busy": "2024-05-23T18:11:31.373526Z",
     "iopub.status.idle": "2024-05-23T18:11:31.379847Z",
     "shell.execute_reply": "2024-05-23T18:11:31.379051Z",
     "shell.execute_reply.started": "2024-05-23T18:11:31.373860Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    recA_lib-1\n",
       "1    recA_lib-2\n",
       "Name: name, dtype: object"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pacbio_runs[\"name\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-05-23T18:11:31.381209Z",
     "iopub.status.busy": "2024-05-23T18:11:31.380881Z",
     "iopub.status.idle": "2024-05-23T18:11:31.386391Z",
     "shell.execute_reply": "2024-05-23T18:11:31.385577Z",
     "shell.execute_reply.started": "2024-05-23T18:11:31.381181Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "('recA_lib-1', 'recA_lib-2')"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tuple(pacbio_runs[\"name\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-05-23T18:11:31.387769Z",
     "iopub.status.busy": "2024-05-23T18:11:31.387418Z",
     "iopub.status.idle": "2024-05-23T18:11:32.165947Z",
     "shell.execute_reply": "2024-05-23T18:11:32.164411Z",
     "shell.execute_reply.started": "2024-05-23T18:11:31.387737Z"
    }
   },
   "outputs": [],
   "source": [
    "ccs_summaries = alignparse.ccs.Summaries(pacbio_runs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(Note if you did not have the `ccs` reports, you could still do the steps above but would need to set `report_col=None` when creating the [Summaries](https://jbloomlab.github.io/alignparse/alignparse.ccs.html#alignparse.ccs.Summaries) object, and then you could not analyze ZMW stats as done below.)\n",
    "\n",
    "Plot how many ZMWs yielded CCSs:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-05-23T18:11:32.168569Z",
     "iopub.status.busy": "2024-05-23T18:11:32.168007Z",
     "iopub.status.idle": "2024-05-23T18:11:32.783337Z",
     "shell.execute_reply": "2024-05-23T18:11:32.782402Z",
     "shell.execute_reply.started": "2024-05-23T18:11:32.168515Z"
    }
   },
   "outputs": [],
   "source": [
    "# NBVAL_IGNORE_OUTPUT\n",
    "p = ccs_summaries.plot_zmw_stats()\n",
    "p = p + theme(panel_grid_major_x=element_blank())  # no vertical grid lines\n",
    "_ = p.draw()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also get the ZMW stats as numerical values:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-05-23T18:11:32.784919Z",
     "iopub.status.busy": "2024-05-23T18:11:32.784530Z",
     "iopub.status.idle": "2024-05-23T18:11:32.831505Z",
     "shell.execute_reply": "2024-05-23T18:11:32.830894Z",
     "shell.execute_reply.started": "2024-05-23T18:11:32.784893Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>name</th>\n",
       "      <th>status</th>\n",
       "      <th>number</th>\n",
       "      <th>fraction</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>recA_lib-1</td>\n",
       "      <td>Success -- CCS generated</td>\n",
       "      <td>139</td>\n",
       "      <td>0.837349</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>recA_lib-1</td>\n",
       "      <td>Failed -- Lacking full passes</td>\n",
       "      <td>19</td>\n",
       "      <td>0.114458</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>recA_lib-1</td>\n",
       "      <td>Failed -- Draft generation error</td>\n",
       "      <td>3</td>\n",
       "      <td>0.018072</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>recA_lib-1</td>\n",
       "      <td>Failed -- CCS below minimum RQ</td>\n",
       "      <td>2</td>\n",
       "      <td>0.012048</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>recA_lib-1</td>\n",
       "      <td>Failed -- Min coverage violation</td>\n",
       "      <td>1</td>\n",
       "      <td>0.006024</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>recA_lib-1</td>\n",
       "      <td>Failed -- Other reason</td>\n",
       "      <td>2</td>\n",
       "      <td>0.012048</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>recA_lib-2</td>\n",
       "      <td>Success -- CCS generated</td>\n",
       "      <td>124</td>\n",
       "      <td>0.794872</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>recA_lib-2</td>\n",
       "      <td>Failed -- Lacking full passes</td>\n",
       "      <td>22</td>\n",
       "      <td>0.141026</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>recA_lib-2</td>\n",
       "      <td>Failed -- Draft generation error</td>\n",
       "      <td>4</td>\n",
       "      <td>0.025641</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>recA_lib-2</td>\n",
       "      <td>Failed -- CCS below minimum RQ</td>\n",
       "      <td>2</td>\n",
       "      <td>0.012821</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>recA_lib-2</td>\n",
       "      <td>Failed -- Min coverage violation</td>\n",
       "      <td>2</td>\n",
       "      <td>0.012821</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>recA_lib-2</td>\n",
       "      <td>Failed -- Other reason</td>\n",
       "      <td>2</td>\n",
       "      <td>0.012821</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          name                            status  number  fraction\n",
       "0   recA_lib-1          Success -- CCS generated     139  0.837349\n",
       "1   recA_lib-1     Failed -- Lacking full passes      19  0.114458\n",
       "2   recA_lib-1  Failed -- Draft generation error       3  0.018072\n",
       "3   recA_lib-1    Failed -- CCS below minimum RQ       2  0.012048\n",
       "4   recA_lib-1  Failed -- Min coverage violation       1  0.006024\n",
       "5   recA_lib-1            Failed -- Other reason       2  0.012048\n",
       "6   recA_lib-2          Success -- CCS generated     124  0.794872\n",
       "7   recA_lib-2     Failed -- Lacking full passes      22  0.141026\n",
       "8   recA_lib-2  Failed -- Draft generation error       4  0.025641\n",
       "9   recA_lib-2    Failed -- CCS below minimum RQ       2  0.012821\n",
       "10  recA_lib-2  Failed -- Min coverage violation       2  0.012821\n",
       "11  recA_lib-2            Failed -- Other reason       2  0.012821"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ccs_summaries.zmw_stats()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Statistics on the CCSs (length, number of subread passes, quality):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-05-23T18:11:32.833209Z",
     "iopub.status.busy": "2024-05-23T18:11:32.832843Z",
     "iopub.status.idle": "2024-05-23T18:11:33.710580Z",
     "shell.execute_reply": "2024-05-23T18:11:33.708786Z",
     "shell.execute_reply.started": "2024-05-23T18:11:32.833184Z"
    }
   },
   "outputs": [],
   "source": [
    "# NBVAL_IGNORE_OUTPUT\n",
    "for stat in [\"length\", \"passes\", \"accuracy\"]:\n",
    "    if ccs_summaries.has_stat(stat):\n",
    "        p = ccs_summaries.plot_ccs_stats(stat)\n",
    "        p = p + theme(panel_grid_major_x=element_blank())  # no vertical grid lines\n",
    "        _ = p.draw()\n",
    "    else:\n",
    "        print(f\"No {stat} statistics available.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also get these statistics numerically; for instance:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-05-23T18:11:33.713769Z",
     "iopub.status.busy": "2024-05-23T18:11:33.713146Z",
     "iopub.status.idle": "2024-05-23T18:11:33.726873Z",
     "shell.execute_reply": "2024-05-23T18:11:33.726189Z",
     "shell.execute_reply.started": "2024-05-23T18:11:33.713708Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>name</th>\n",
       "      <th>length</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>recA_lib-1</td>\n",
       "      <td>1325</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>recA_lib-1</td>\n",
       "      <td>1340</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>recA_lib-1</td>\n",
       "      <td>1339</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>recA_lib-1</td>\n",
       "      <td>985</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>recA_lib-1</td>\n",
       "      <td>1196</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         name  length\n",
       "0  recA_lib-1    1325\n",
       "1  recA_lib-1    1340\n",
       "2  recA_lib-1    1339\n",
       "3  recA_lib-1     985\n",
       "4  recA_lib-1    1196"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ccs_summaries.ccs_stats(\"length\").head(n=5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Align and parse CCSs\n",
    "Now we align the CCSs using our [Targets](https://jbloomlab.github.io/alignparse/alignparse.targets.html#alignparse.targets.Targets) object, and parse the features from queries that align sufficiently well.\n",
    "\n",
    "First, we create an [alignparse.minimap2.Mapper](https://jbloomlab.github.io/alignparse/alignparse.minimap2.html#alignparse.minimap2.Mapper) object to run [minimap2](https://github.com/lh3/minimap2), which is used for the alignments.\n",
    "We use [minimap2](https://github.com/lh3/minimap2) options that are tailored for codon-level deep mutational scanning experiments like this one (these are specified by [alignparse.minimap2.OPTIONS_CODON_DMS](https://jbloomlab.github.io/alignparse/alignparse.minimap2.html#alignparse.minimap2.OPTIONS_CODON_DMS)):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-05-23T18:11:33.728152Z",
     "iopub.status.busy": "2024-05-23T18:11:33.727822Z",
     "iopub.status.idle": "2024-05-23T18:11:33.741144Z",
     "shell.execute_reply": "2024-05-23T18:11:33.739902Z",
     "shell.execute_reply.started": "2024-05-23T18:11:33.728127Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Using `minimap2` 2.22-r1101 with these options:\n",
      "-A2 -B4 -O12 -E2 --end-bonus=13 --secondary=no --cs\n"
     ]
    }
   ],
   "source": [
    "# NBVAL_IGNORE_OUTPUT\n",
    "\n",
    "mapper = alignparse.minimap2.Mapper(alignparse.minimap2.OPTIONS_CODON_DMS)\n",
    "\n",
    "print(\n",
    "    f\"Using `minimap2` {mapper.version} with these options:\\n\"\n",
    "    + \" \".join(mapper.options)\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now use [Targets.align_and_parse](https://jbloomlab.github.io/alignparse/alignparse.targets.html#alignparse.targets.Targets.align_and_parse) to align and parse our FASTQ files of CCSs.\n",
    "(Note that if needed, you can also perform each of these steps separately for each FASTQ file by running [Targets.align](https://jbloomlab.github.io/alignparse/alignparse.targets.html#alignparse.targets.Targets.align) and [Targets.parse_alignments](https://jbloomlab.github.io/alignparse/alignparse.targets.html#alignparse.targets.Targets.parse_alignment) separately. An example of this is in the [Lassa virus glycoprotein](https://jbloomlab.github.io/alignparse/lasv_pilot.html) example notebook)\n",
    "\n",
    "First, we define a directory to place the created files:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-05-23T18:11:33.743528Z",
     "iopub.status.busy": "2024-05-23T18:11:33.742927Z",
     "iopub.status.idle": "2024-05-23T18:11:33.748800Z",
     "shell.execute_reply": "2024-05-23T18:11:33.747687Z",
     "shell.execute_reply.started": "2024-05-23T18:11:33.743471Z"
    }
   },
   "outputs": [],
   "source": [
    "align_and_parse_outdir = os.path.join(outdir, \"RecA_align_and_parse\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we run [Targets.align_and_parse](https://jbloomlab.github.io/alignparse/alignparse.targets.html#alignparse.targets.Targets.align_and_parse) on all the CCS sets in the data frame `pacbio_runs`, which we set up above to specify information on each PacBio run.\n",
    "The `name_col` gives the column in the data frame giving the name of each run, the `queryfile_col` gives the column with the FASTQ file for that run, and `group_cols` specifies any columns showing how to group runs when aggregating results (here we aggregate results by library):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-05-23T18:11:33.758115Z",
     "iopub.status.busy": "2024-05-23T18:11:33.757696Z",
     "iopub.status.idle": "2024-05-23T18:11:34.912577Z",
     "shell.execute_reply": "2024-05-23T18:11:34.911220Z",
     "shell.execute_reply.started": "2024-05-23T18:11:33.758084Z"
    }
   },
   "outputs": [],
   "source": [
    "readstats, aligned, filtered = targets.align_and_parse(\n",
    "    df=pacbio_runs,\n",
    "    mapper=mapper,\n",
    "    outdir=align_and_parse_outdir,\n",
    "    name_col=\"name\",\n",
    "    queryfile_col=\"fastq\",\n",
    "    group_cols=[\"library\"],\n",
    "    overwrite=True,  # overwrite any existing output\n",
    "    ncpus=-1,  # use all available CPUs\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The return value from running the function above is a tuple of three elements: `readstats`, `aligned`, and `filtered`.\n",
    "We go through these one by one.\n",
    "\n",
    "The `readstats` variable is a data frame that gives summary statistics for each run:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-05-23T18:11:34.914587Z",
     "iopub.status.busy": "2024-05-23T18:11:34.914153Z",
     "iopub.status.idle": "2024-05-23T18:11:34.928149Z",
     "shell.execute_reply": "2024-05-23T18:11:34.926907Z",
     "shell.execute_reply.started": "2024-05-23T18:11:34.914551Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>library</th>\n",
       "      <th>name</th>\n",
       "      <th>category</th>\n",
       "      <th>count</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>lib-1</td>\n",
       "      <td>recA_lib-1</td>\n",
       "      <td>filtered RecA_PacBio_amplicon</td>\n",
       "      <td>16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>lib-1</td>\n",
       "      <td>recA_lib-1</td>\n",
       "      <td>aligned RecA_PacBio_amplicon</td>\n",
       "      <td>123</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>lib-1</td>\n",
       "      <td>recA_lib-1</td>\n",
       "      <td>unmapped</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>lib-2</td>\n",
       "      <td>recA_lib-2</td>\n",
       "      <td>filtered RecA_PacBio_amplicon</td>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>lib-2</td>\n",
       "      <td>recA_lib-2</td>\n",
       "      <td>aligned RecA_PacBio_amplicon</td>\n",
       "      <td>112</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>lib-2</td>\n",
       "      <td>recA_lib-2</td>\n",
       "      <td>unmapped</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  library        name                       category  count\n",
       "0   lib-1  recA_lib-1  filtered RecA_PacBio_amplicon     16\n",
       "1   lib-1  recA_lib-1   aligned RecA_PacBio_amplicon    123\n",
       "2   lib-1  recA_lib-1                       unmapped      0\n",
       "3   lib-2  recA_lib-2  filtered RecA_PacBio_amplicon     12\n",
       "4   lib-2  recA_lib-2   aligned RecA_PacBio_amplicon    112\n",
       "5   lib-2  recA_lib-2                       unmapped      0"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "readstats"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Above we see that all reads mapped to our single target (`RecA_PacBio_amplicon`), and that most could be fully aligned and parsed given our `feature_parse_specs`, but that some were filtered for not passing these specs.\n",
    "\n",
    "We can plot `readstats` for easy viewing:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-05-23T18:11:34.930535Z",
     "iopub.status.busy": "2024-05-23T18:11:34.930006Z",
     "iopub.status.idle": "2024-05-23T18:11:35.264328Z",
     "shell.execute_reply": "2024-05-23T18:11:35.263529Z",
     "shell.execute_reply.started": "2024-05-23T18:11:34.930489Z"
    }
   },
   "outputs": [],
   "source": [
    "# NBVAL_IGNORE_OUTPUT\n",
    "p = (\n",
    "    ggplot(readstats, aes(\"category\", \"count\"))\n",
    "    + geom_bar(stat=\"identity\")\n",
    "    + facet_wrap(\"~ name\", nrow=1)\n",
    "    + theme(\n",
    "        axis_text_x=element_text(angle=90),\n",
    "        figure_size=(1.5 * len(pacbio_runs), 2.5),\n",
    "        panel_grid_major_x=element_blank(),  # no vertical grid lines\n",
    "    )\n",
    ")\n",
    "_ = p.draw()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `aligned` variable is a dict that is keyed by each target name, and then gives a data frame with information on all queries (CCSs) that were successfully aligned and parsed:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-05-23T18:11:35.265625Z",
     "iopub.status.busy": "2024-05-23T18:11:35.265382Z",
     "iopub.status.idle": "2024-05-23T18:11:35.281505Z",
     "shell.execute_reply": "2024-05-23T18:11:35.280841Z",
     "shell.execute_reply.started": "2024-05-23T18:11:35.265601Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "First few lines of parsed alignments for RecA_PacBio_amplicon\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>library</th>\n",
       "      <th>name</th>\n",
       "      <th>query_name</th>\n",
       "      <th>query_clip5</th>\n",
       "      <th>query_clip3</th>\n",
       "      <th>gene_mutations</th>\n",
       "      <th>gene_accuracy</th>\n",
       "      <th>barcode_sequence</th>\n",
       "      <th>barcode_accuracy</th>\n",
       "      <th>variant_tag5_sequence</th>\n",
       "      <th>variant_tag3_sequence</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>lib-1</td>\n",
       "      <td>recA_lib-1</td>\n",
       "      <td>m54228_180801_171631/4391577/ccs</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>C100A T102A G658C C659T del840to840</td>\n",
       "      <td>0.999455</td>\n",
       "      <td>AAGATACACTCGAAATCT</td>\n",
       "      <td>1.0</td>\n",
       "      <td>G</td>\n",
       "      <td>C</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>lib-1</td>\n",
       "      <td>recA_lib-1</td>\n",
       "      <td>m54228_180801_171631/4915465/ccs</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>C142G G144T T329A A738G A946T C947A</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>AAATATCATCGCGGCCAG</td>\n",
       "      <td>1.0</td>\n",
       "      <td>T</td>\n",
       "      <td>T</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>lib-1</td>\n",
       "      <td>recA_lib-1</td>\n",
       "      <td>m54228_180801_171631/4981392/ccs</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>C142G G144T T329A A738G A946T C947A</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>AAATATCATCGCGGCCAG</td>\n",
       "      <td>1.0</td>\n",
       "      <td>G</td>\n",
       "      <td>C</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>lib-1</td>\n",
       "      <td>recA_lib-1</td>\n",
       "      <td>m54228_180801_171631/6029553/ccs</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>T83A G84A A106T T107A G108A ins693G G862T C863...</td>\n",
       "      <td>0.999940</td>\n",
       "      <td>CTAATAGTAGTTTTCCAG</td>\n",
       "      <td>1.0</td>\n",
       "      <td>G</td>\n",
       "      <td>C</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>lib-1</td>\n",
       "      <td>recA_lib-1</td>\n",
       "      <td>m54228_180801_171631/6488565/ccs</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>A254C G255T A466T T467G C468T C940G G942A</td>\n",
       "      <td>0.999967</td>\n",
       "      <td>TATTTATACCCATGAGTG</td>\n",
       "      <td>1.0</td>\n",
       "      <td>A</td>\n",
       "      <td>T</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  library        name                        query_name  query_clip5  \\\n",
       "0   lib-1  recA_lib-1  m54228_180801_171631/4391577/ccs            1   \n",
       "1   lib-1  recA_lib-1  m54228_180801_171631/4915465/ccs            0   \n",
       "2   lib-1  recA_lib-1  m54228_180801_171631/4981392/ccs            0   \n",
       "3   lib-1  recA_lib-1  m54228_180801_171631/6029553/ccs            0   \n",
       "4   lib-1  recA_lib-1  m54228_180801_171631/6488565/ccs            0   \n",
       "\n",
       "   query_clip3                                     gene_mutations  \\\n",
       "0            0                C100A T102A G658C C659T del840to840   \n",
       "1            0                C142G G144T T329A A738G A946T C947A   \n",
       "2            0                C142G G144T T329A A738G A946T C947A   \n",
       "3            0  T83A G84A A106T T107A G108A ins693G G862T C863...   \n",
       "4            0          A254C G255T A466T T467G C468T C940G G942A   \n",
       "\n",
       "   gene_accuracy    barcode_sequence  barcode_accuracy variant_tag5_sequence  \\\n",
       "0       0.999455  AAGATACACTCGAAATCT               1.0                     G   \n",
       "1       1.000000  AAATATCATCGCGGCCAG               1.0                     T   \n",
       "2       1.000000  AAATATCATCGCGGCCAG               1.0                     G   \n",
       "3       0.999940  CTAATAGTAGTTTTCCAG               1.0                     G   \n",
       "4       0.999967  TATTTATACCCATGAGTG               1.0                     A   \n",
       "\n",
       "  variant_tag3_sequence  \n",
       "0                     C  \n",
       "1                     T  \n",
       "2                     C  \n",
       "3                     C  \n",
       "4                     T  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "for target in targets.target_names:\n",
    "    print(f\"First few lines of parsed alignments for {target}\")\n",
    "    display(aligned[target].head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since we have just one target, we get the data frame in `aligned` for that single target into a new variable (`aligned_df`) and save it for analysis in the next subsection.\n",
    "We also extract just the columns of interest from the data frame, and rename `barcode_sequence` to the shorter name of `barcode`.\n",
    "Also, since real analyses of the barcode typically involve Illumina sequencing it in the reverse direction, we make this new `barcode` column the **reverse complement** of `barcode_sequence`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-05-23T18:11:35.283682Z",
     "iopub.status.busy": "2024-05-23T18:11:35.283290Z",
     "iopub.status.idle": "2024-05-23T18:11:35.301722Z",
     "shell.execute_reply": "2024-05-23T18:11:35.301041Z",
     "shell.execute_reply.started": "2024-05-23T18:11:35.283648Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>library</th>\n",
       "      <th>name</th>\n",
       "      <th>query_name</th>\n",
       "      <th>barcode</th>\n",
       "      <th>gene_mutations</th>\n",
       "      <th>barcode_accuracy</th>\n",
       "      <th>gene_accuracy</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>lib-1</td>\n",
       "      <td>recA_lib-1</td>\n",
       "      <td>m54228_180801_171631/4391577/ccs</td>\n",
       "      <td>AGATTTCGAGTGTATCTT</td>\n",
       "      <td>C100A T102A G658C C659T del840to840</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.999455</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>lib-1</td>\n",
       "      <td>recA_lib-1</td>\n",
       "      <td>m54228_180801_171631/4915465/ccs</td>\n",
       "      <td>CTGGCCGCGATGATATTT</td>\n",
       "      <td>C142G G144T T329A A738G A946T C947A</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>lib-1</td>\n",
       "      <td>recA_lib-1</td>\n",
       "      <td>m54228_180801_171631/4981392/ccs</td>\n",
       "      <td>CTGGCCGCGATGATATTT</td>\n",
       "      <td>C142G G144T T329A A738G A946T C947A</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>lib-1</td>\n",
       "      <td>recA_lib-1</td>\n",
       "      <td>m54228_180801_171631/6029553/ccs</td>\n",
       "      <td>CTGGAAAACTACTATTAG</td>\n",
       "      <td>T83A G84A A106T T107A G108A ins693G G862T C863...</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.999940</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>lib-1</td>\n",
       "      <td>recA_lib-1</td>\n",
       "      <td>m54228_180801_171631/6488565/ccs</td>\n",
       "      <td>CACTCATGGGTATAAATA</td>\n",
       "      <td>A254C G255T A466T T467G C468T C940G G942A</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.999967</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  library        name                        query_name             barcode  \\\n",
       "0   lib-1  recA_lib-1  m54228_180801_171631/4391577/ccs  AGATTTCGAGTGTATCTT   \n",
       "1   lib-1  recA_lib-1  m54228_180801_171631/4915465/ccs  CTGGCCGCGATGATATTT   \n",
       "2   lib-1  recA_lib-1  m54228_180801_171631/4981392/ccs  CTGGCCGCGATGATATTT   \n",
       "3   lib-1  recA_lib-1  m54228_180801_171631/6029553/ccs  CTGGAAAACTACTATTAG   \n",
       "4   lib-1  recA_lib-1  m54228_180801_171631/6488565/ccs  CACTCATGGGTATAAATA   \n",
       "\n",
       "                                      gene_mutations  barcode_accuracy  \\\n",
       "0                C100A T102A G658C C659T del840to840               1.0   \n",
       "1                C142G G144T T329A A738G A946T C947A               1.0   \n",
       "2                C142G G144T T329A A738G A946T C947A               1.0   \n",
       "3  T83A G84A A106T T107A G108A ins693G G862T C863...               1.0   \n",
       "4          A254C G255T A466T T467G C468T C940G G942A               1.0   \n",
       "\n",
       "   gene_accuracy  \n",
       "0       0.999455  \n",
       "1       1.000000  \n",
       "2       1.000000  \n",
       "3       0.999940  \n",
       "4       0.999967  "
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "assert len(aligned) == 1, \"not just one target\"\n",
    "\n",
    "aligned_df = aligned[targets.target_names[0]].assign(\n",
    "    barcode=lambda x: x[\"barcode_sequence\"].map(dms_variants.utils.reverse_complement)\n",
    ")[\n",
    "    [\n",
    "        \"library\",\n",
    "        \"name\",\n",
    "        \"query_name\",\n",
    "        \"barcode\",\n",
    "        \"gene_mutations\",\n",
    "        \"barcode_accuracy\",\n",
    "        \"gene_accuracy\",\n",
    "    ]\n",
    "]\n",
    "\n",
    "aligned_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, the `filtered` variable gives information on why queries that aligned but were filtered (failed `feature_parse_specs`) were filtered.\n",
    "Like `aligned`, `filtered` is a dict keyed by target name with values being data frames giving the information:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-05-23T18:11:35.303428Z",
     "iopub.status.busy": "2024-05-23T18:11:35.303081Z",
     "iopub.status.idle": "2024-05-23T18:11:35.314734Z",
     "shell.execute_reply": "2024-05-23T18:11:35.313849Z",
     "shell.execute_reply.started": "2024-05-23T18:11:35.303401Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "First few lines of filtering information for RecA_PacBio_amplicon\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>library</th>\n",
       "      <th>name</th>\n",
       "      <th>query_name</th>\n",
       "      <th>filter_reason</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>lib-1</td>\n",
       "      <td>recA_lib-1</td>\n",
       "      <td>m54228_180801_171631/4194459/ccs</td>\n",
       "      <td>spacer mutation_nt_count</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>lib-1</td>\n",
       "      <td>recA_lib-1</td>\n",
       "      <td>m54228_180801_171631/4325806/ccs</td>\n",
       "      <td>barcode mutation_nt_count</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>lib-1</td>\n",
       "      <td>recA_lib-1</td>\n",
       "      <td>m54228_180801_171631/4391313/ccs</td>\n",
       "      <td>termini3 mutation_nt_count</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>lib-1</td>\n",
       "      <td>recA_lib-1</td>\n",
       "      <td>m54228_180801_171631/4391375/ccs</td>\n",
       "      <td>gene clip3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>lib-1</td>\n",
       "      <td>recA_lib-1</td>\n",
       "      <td>m54228_180801_171631/4391467/ccs</td>\n",
       "      <td>gene clip3</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  library        name                        query_name  \\\n",
       "0   lib-1  recA_lib-1  m54228_180801_171631/4194459/ccs   \n",
       "1   lib-1  recA_lib-1  m54228_180801_171631/4325806/ccs   \n",
       "2   lib-1  recA_lib-1  m54228_180801_171631/4391313/ccs   \n",
       "3   lib-1  recA_lib-1  m54228_180801_171631/4391375/ccs   \n",
       "4   lib-1  recA_lib-1  m54228_180801_171631/4391467/ccs   \n",
       "\n",
       "                filter_reason  \n",
       "0    spacer mutation_nt_count  \n",
       "1   barcode mutation_nt_count  \n",
       "2  termini3 mutation_nt_count  \n",
       "3                  gene clip3  \n",
       "4                  gene clip3  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "for target in targets.target_names:\n",
    "    print(f\"First few lines of filtering information for {target}\")\n",
    "    display(filtered[target].head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As can be seen above, the `filter_reason` column gives which particular specification in `feature_parse_specs` was not met.\n",
    "\n",
    "Plot this inforrmation:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-05-23T18:11:35.316220Z",
     "iopub.status.busy": "2024-05-23T18:11:35.315831Z",
     "iopub.status.idle": "2024-05-23T18:11:35.601375Z",
     "shell.execute_reply": "2024-05-23T18:11:35.600487Z",
     "shell.execute_reply.started": "2024-05-23T18:11:35.316191Z"
    }
   },
   "outputs": [],
   "source": [
    "# NBVAL_IGNORE_OUTPUT\n",
    "for targetname in targets.target_names:\n",
    "    target_filtered = filtered[targetname]\n",
    "    nreasons = target_filtered[\"filter_reason\"].nunique()\n",
    "    p = (\n",
    "        ggplot(target_filtered, aes(\"filter_reason\"))\n",
    "        + geom_bar()\n",
    "        + facet_wrap(\"~ name\", nrow=1)\n",
    "        + labs(title=targetname)\n",
    "        + theme(\n",
    "            axis_text_x=element_text(angle=90),\n",
    "            figure_size=(0.3 * nreasons * len(pacbio_runs), 2.5),\n",
    "            panel_grid_major_x=element_blank(),  # no vertical grid lines\n",
    "        )\n",
    "    )\n",
    "    _ = p.draw()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The example usage of [Targets.align_and_parse](https://jbloomlab.github.io/alignparse/alignparse.targets.html#alignparse.targets.Targets.align_and_parse) above read all of the information on the parsed alignments into data frames.\n",
    "For large data sets, these data frames might be so large that you don't want to read them into memory.\n",
    "In that case, use the `to_csv` option, which makes [Targets.align_and_parse](https://jbloomlab.github.io/alignparse/alignparse.targets.html#alignparse.targets.Targets.align_and_parse) simply give the locations of CSV files holding the data frames.\n",
    "Here is an example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-05-23T18:11:35.602845Z",
     "iopub.status.busy": "2024-05-23T18:11:35.602604Z",
     "iopub.status.idle": "2024-05-23T18:11:36.764052Z",
     "shell.execute_reply": "2024-05-23T18:11:36.762178Z",
     "shell.execute_reply.started": "2024-05-23T18:11:35.602821Z"
    }
   },
   "outputs": [],
   "source": [
    "readstats_csv, aligned_csv, filtered_csv = targets.align_and_parse(\n",
    "    df=pacbio_runs,\n",
    "    mapper=mapper,\n",
    "    outdir=align_and_parse_outdir,\n",
    "    name_col=\"name\",\n",
    "    queryfile_col=\"fastq\",\n",
    "    group_cols=[\"library\"],\n",
    "    to_csv=True,\n",
    "    overwrite=True,  # overwrite any existing output\n",
    "    ncpus=-1,  # use all available CPUs\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now the returned information on the parsed alignments and filtering just gives the locations of the CSV files for the data frames:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-05-23T18:11:36.766619Z",
     "iopub.status.busy": "2024-05-23T18:11:36.766143Z",
     "iopub.status.idle": "2024-05-23T18:11:36.772286Z",
     "shell.execute_reply": "2024-05-23T18:11:36.771445Z",
     "shell.execute_reply.started": "2024-05-23T18:11:36.766576Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'RecA_PacBio_amplicon': './output_files/RecA_align_and_parse/RecA_PacBio_amplicon_aligned.csv'}\n",
      "{'RecA_PacBio_amplicon': './output_files/RecA_align_and_parse/RecA_PacBio_amplicon_filtered.csv'}\n"
     ]
    }
   ],
   "source": [
    "print(aligned_csv)\n",
    "print(filtered_csv)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that since no mutations are specified as empty strings in these CSV files, if you read them using [pandas.read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html), you then need to do so using `na_filter=None` (i.e., `pandas.read_csv(<csv_file>, na_filter=None)`) so that empty strings are not converted to `nan` values.\n",
    "\n",
    "Here are all of the files created by running [Targets.align_and_parse](https://jbloomlab.github.io/alignparse/alignparse.targets.html#alignparse.targets.Targets.align_and_parse) (they also include SAM alignments and parsing results for each individual run):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-05-23T18:11:36.773975Z",
     "iopub.status.busy": "2024-05-23T18:11:36.773582Z",
     "iopub.status.idle": "2024-05-23T18:11:36.783794Z",
     "shell.execute_reply": "2024-05-23T18:11:36.782976Z",
     "shell.execute_reply.started": "2024-05-23T18:11:36.773942Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Contents of ./output_files/RecA_align_and_parse:\n",
      "------------------------------------------------\n",
      "  RecA_PacBio_amplicon_aligned.csv\n",
      "  RecA_PacBio_amplicon_filtered.csv\n",
      "  lib-1_recA_lib-1/RecA_PacBio_amplicon_aligned.csv\n",
      "  lib-1_recA_lib-1/RecA_PacBio_amplicon_filtered.csv\n",
      "  lib-1_recA_lib-1/alignments.sam\n",
      "  lib-2_recA_lib-2/RecA_PacBio_amplicon_aligned.csv\n",
      "  lib-2_recA_lib-2/RecA_PacBio_amplicon_filtered.csv\n",
      "  lib-2_recA_lib-2/alignments.sam\n"
     ]
    }
   ],
   "source": [
    "print(f\"Contents of {align_and_parse_outdir}:\\n\" + \"-\" * 48)\n",
    "for d, _, fs in sorted(os.walk(align_and_parse_outdir)):\n",
    "    for f in sorted(fs):\n",
    "        print(\"  \" + os.path.relpath(os.path.join(d, f), align_and_parse_outdir))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Per-barcode consensus sequences\n",
    "In a deep mutational scanning experiment, we typically want to determine the sequence of the gene variant associated with each barcode.\n",
    "In this section, we do that--and also estimate the empirical accuracy of the sequencing by determining how often two sequences with the same barcode are identical.\n",
    "\n",
    "We start with the `aligned_df` data frame generated in the previous subsection.\n",
    "First, we want to plot the distribution of accuracies for the barcodes and genes.\n",
    "Because these span a wide range, it's most convenient to convert the accuracies into error rates (1 - accuracy), and then plot these error rates on a log scale.\n",
    "In order to do this, we also need to set some floor for the error rates since zero can't be plotted on a log scale.\n",
    "Compute these \"floored\" error rates:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-05-23T18:11:36.785813Z",
     "iopub.status.busy": "2024-05-23T18:11:36.785440Z",
     "iopub.status.idle": "2024-05-23T18:11:36.795202Z",
     "shell.execute_reply": "2024-05-23T18:11:36.794178Z",
     "shell.execute_reply.started": "2024-05-23T18:11:36.785784Z"
    }
   },
   "outputs": [],
   "source": [
    "error_rate_floor = 1e-7  # error rates < this set to this\n",
    "\n",
    "aligned_df = aligned_df.assign(\n",
    "    barcode_error=lambda x: numpy.clip(\n",
    "        1 - x[\"barcode_accuracy\"], error_rate_floor, None\n",
    "    ),\n",
    "    gene_error=lambda x: numpy.clip(1 - x[\"gene_accuracy\"], error_rate_floor, None),\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We anticipate excluding all CCSs for which the error rate for either the gene or barcode is $>10^{-4}$.\n",
    "Specify this cutoff:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-05-23T18:11:36.796919Z",
     "iopub.status.busy": "2024-05-23T18:11:36.796481Z",
     "iopub.status.idle": "2024-05-23T18:11:36.801126Z",
     "shell.execute_reply": "2024-05-23T18:11:36.800184Z",
     "shell.execute_reply.started": "2024-05-23T18:11:36.796885Z"
    }
   },
   "outputs": [],
   "source": [
    "error_cutoff = 1e-4"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now plot the distributiton of these error rates, drawing an orange vertical line at the cutoff:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-05-23T18:11:36.802650Z",
     "iopub.status.busy": "2024-05-23T18:11:36.802265Z",
     "iopub.status.idle": "2024-05-23T18:11:37.481052Z",
     "shell.execute_reply": "2024-05-23T18:11:37.480057Z",
     "shell.execute_reply.started": "2024-05-23T18:11:36.802623Z"
    }
   },
   "outputs": [],
   "source": [
    "# NBVAL_IGNORE_OUTPUT\n",
    "p = (\n",
    "    ggplot(\n",
    "        aligned_df.melt(\n",
    "            id_vars=[\"library\"],\n",
    "            value_vars=[\"barcode_error\", \"gene_error\"],\n",
    "            var_name=\"feature_type\",\n",
    "            value_name=\"error rate\",\n",
    "        ),\n",
    "        aes(\"error rate\"),\n",
    "    )\n",
    "    + geom_histogram(bins=25)\n",
    "    + geom_vline(xintercept=error_cutoff, linetype=\"dashed\", color=CBPALETTE[1])\n",
    "    + facet_grid(\"library ~ feature_type\")\n",
    "    + theme(figure_size=(4.5, 2 * len(libraries)))\n",
    "    + ylab(\"number of CCSs\")\n",
    "    + scale_x_log10()\n",
    ")\n",
    "\n",
    "_ = p.draw()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The plot above shows that a modest number of CCSs fail the error-rate filters (are to the right of the cutoff).\n",
    "\n",
    "We mark to retain the CCSs that pass the filters:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-05-23T18:11:37.482622Z",
     "iopub.status.busy": "2024-05-23T18:11:37.482385Z",
     "iopub.status.idle": "2024-05-23T18:11:37.490347Z",
     "shell.execute_reply": "2024-05-23T18:11:37.489475Z",
     "shell.execute_reply.started": "2024-05-23T18:11:37.482599Z"
    }
   },
   "outputs": [],
   "source": [
    "aligned_df = aligned_df.assign(\n",
    "    retained=lambda x: (\n",
    "        (x[\"gene_error\"] <= error_cutoff) & (x[\"barcode_error\"] <= error_cutoff)\n",
    "    )\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here are the numbers retained:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-05-23T18:11:37.491699Z",
     "iopub.status.busy": "2024-05-23T18:11:37.491372Z",
     "iopub.status.idle": "2024-05-23T18:11:37.506206Z",
     "shell.execute_reply": "2024-05-23T18:11:37.505526Z",
     "shell.execute_reply.started": "2024-05-23T18:11:37.491672Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>library</th>\n",
       "      <th>retained</th>\n",
       "      <th>number of CCSs</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>lib-1</td>\n",
       "      <td>False</td>\n",
       "      <td>33</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>lib-1</td>\n",
       "      <td>True</td>\n",
       "      <td>90</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>lib-2</td>\n",
       "      <td>False</td>\n",
       "      <td>39</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>lib-2</td>\n",
       "      <td>True</td>\n",
       "      <td>73</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  library  retained  number of CCSs\n",
       "0   lib-1     False              33\n",
       "1   lib-1      True              90\n",
       "2   lib-2     False              39\n",
       "3   lib-2      True              73"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(\n",
    "    aligned_df.groupby([\"library\", \"retained\"])\n",
    "    .size()\n",
    "    .rename(\"number of CCSs\")\n",
    "    .reset_index()\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Before getting the consensus sequence for each barcode, we next want to know how many different CCSs we have per barcode among the retained CCSs.\n",
    "Plot this distribution:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-05-23T18:11:37.507551Z",
     "iopub.status.busy": "2024-05-23T18:11:37.507294Z",
     "iopub.status.idle": "2024-05-23T18:11:37.715403Z",
     "shell.execute_reply": "2024-05-23T18:11:37.714647Z",
     "shell.execute_reply.started": "2024-05-23T18:11:37.507523Z"
    }
   },
   "outputs": [],
   "source": [
    "# NBVAL_IGNORE_OUTPUT\n",
    "max_CCSs = 8  # in plot, group barcodes with >= this many CCSs\n",
    "\n",
    "p = (\n",
    "    ggplot(\n",
    "        aligned_df.query(\"retained\")\n",
    "        .groupby([\"library\", \"barcode\"])\n",
    "        .size()\n",
    "        .rename(\"nseqs\")\n",
    "        .reset_index()\n",
    "        .assign(nseqs=lambda x: numpy.clip(x[\"nseqs\"], None, max_CCSs)),\n",
    "        aes(\"nseqs\"),\n",
    "    )\n",
    "    + geom_bar()\n",
    "    + facet_wrap(\"~ library\", nrow=1)\n",
    "    + theme(\n",
    "        figure_size=(1.75 * len(libraries), 2),\n",
    "        panel_grid_major_x=element_blank(),  # no vertial tick lines\n",
    "    )\n",
    "    + ylab(\"number of barcodes\")\n",
    "    + xlab(\"CCSs for barcode\")\n",
    ")\n",
    "\n",
    "_ = p.draw()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see above that barcodes are often sequenced just once, but are also often sequenced two or more times.\n",
    "\n",
    "From the barcodes with multiple CCSs, we can estimate the true (or \"empirical\") accuracy of the CCSs.\n",
    "This is different than the purported accuracy returned by the `ccs` program and plotted above.\n",
    "Those `ccs` accuracies are PacBio's estimation of accuracy from the sequencing, but they may not be fully correct.\n",
    "In addition, not all the \"error\" comes from pure sequencing errors: we can also have experimental factors such as barcode collisions (two different variants sharing the same barcode) or PCR strand exchange make molecules with the same barcode actually different.\n",
    "\n",
    "The concept of empirical accuracy is quite simple: we look to see how often CCSs with the same barcode actually have the same gene sequence.\n",
    "If the sequencing is accurate and their aren't additional experimental factors causing effective errors, then CCSs with the same barcode will always be identical.\n",
    "The less often they are identical, the lower the empirical accuracy.\n",
    "The actual calculation of empirical accuracy is implemented in [alignparse.consensus.empirical_accuracy](https://jbloomlab.github.io/alignparse/alignparse.consensus.html#alignparse.consensus.empirical_accuracy) (see the docs of that function for a full explanation of the calculation).\n",
    "\n",
    "In addition, we'd like to split out the analysis by whether or not the CCSs have an indel.\n",
    "The reason is that indels are the main source of error in PacBio sequencing, but we plan to disregard all sequences with indels anyway.\n",
    "So first use [alignparse.consensus.add_mut_info_cols](https://jbloomlab.github.io/alignparse/alignparse.consensus.html#alignparse.consensus.add_mut_info_cols) to add information about whether the CCSs have indels:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-05-23T18:11:37.716641Z",
     "iopub.status.busy": "2024-05-23T18:11:37.716421Z",
     "iopub.status.idle": "2024-05-23T18:11:37.729476Z",
     "shell.execute_reply": "2024-05-23T18:11:37.728770Z",
     "shell.execute_reply.started": "2024-05-23T18:11:37.716619Z"
    }
   },
   "outputs": [],
   "source": [
    "aligned_df = alignparse.consensus.add_mut_info_cols(\n",
    "    aligned_df, mutation_col=\"gene_mutations\", n_indel_col=\"n_indels\"\n",
    ")\n",
    "\n",
    "aligned_df = aligned_df.assign(has_indel=lambda x: x[\"n_indels\"] > 0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here are the numbers with and without indels among the retained CCSs:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-05-23T18:11:37.730691Z",
     "iopub.status.busy": "2024-05-23T18:11:37.730441Z",
     "iopub.status.idle": "2024-05-23T18:11:37.745522Z",
     "shell.execute_reply": "2024-05-23T18:11:37.744859Z",
     "shell.execute_reply.started": "2024-05-23T18:11:37.730664Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>library</th>\n",
       "      <th>has_indel</th>\n",
       "      <th>number_CCSs</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>lib-1</td>\n",
       "      <td>False</td>\n",
       "      <td>76</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>lib-1</td>\n",
       "      <td>True</td>\n",
       "      <td>14</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>lib-2</td>\n",
       "      <td>False</td>\n",
       "      <td>65</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>lib-2</td>\n",
       "      <td>True</td>\n",
       "      <td>8</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  library  has_indel  number_CCSs\n",
       "0   lib-1      False           76\n",
       "1   lib-1       True           14\n",
       "2   lib-2      False           65\n",
       "3   lib-2       True            8"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(\n",
    "    aligned_df.query(\"retained\")\n",
    "    .groupby([\"library\", \"has_indel\"])\n",
    "    .size()\n",
    "    .rename(\"number_CCSs\")\n",
    "    .reset_index()\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can compute the empirical accuracy.\n",
    "First, among all retained CCSs:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-05-23T18:11:37.746925Z",
     "iopub.status.busy": "2024-05-23T18:11:37.746665Z",
     "iopub.status.idle": "2024-05-23T18:11:37.805058Z",
     "shell.execute_reply": "2024-05-23T18:11:37.804203Z",
     "shell.execute_reply.started": "2024-05-23T18:11:37.746899Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>library</th>\n",
       "      <th>accuracy</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>lib-1</td>\n",
       "      <td>0.864817</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>lib-2</td>\n",
       "      <td>0.786710</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  library  accuracy\n",
       "0   lib-1  0.864817\n",
       "1   lib-2  0.786710"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "alignparse.consensus.empirical_accuracy(\n",
    "    aligned_df.query(\"retained\"), mutation_col=\"gene_mutations\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And excluding CCSs with indels:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-05-23T18:11:37.806962Z",
     "iopub.status.busy": "2024-05-23T18:11:37.806552Z",
     "iopub.status.idle": "2024-05-23T18:11:37.851233Z",
     "shell.execute_reply": "2024-05-23T18:11:37.850716Z",
     "shell.execute_reply.started": "2024-05-23T18:11:37.806931Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>library</th>\n",
       "      <th>accuracy</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>lib-1</td>\n",
       "      <td>0.948033</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>lib-2</td>\n",
       "      <td>0.928539</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  library  accuracy\n",
       "0   lib-1  0.948033\n",
       "1   lib-2  0.928539"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "alignparse.consensus.empirical_accuracy(\n",
    "    aligned_df.query(\"retained & not has_indel\"), mutation_col=\"gene_mutations\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As can be seen above, the empirical accuracy is quite a bit higher when excluding sequences with indels, which is what will happen in practice below.\n",
    "However, the empirical accuracy is still much lower than the PacBio `ccs` estimated accuracy, since above we only retained CCSs with accuracy >99.99%.\n",
    "This indicates that either the `ccs` estimated accuracies are not actually accurate, or other experimental factors also contribute to decrease the empirical accuracy.\n",
    "You can play around with the filter on the `ccs`-estimated accuracies to see how they affect the empirical accuracy--but in general, we find on real (larger) datasets that beyond a point, increasing the filter on the `ccs`-estimated accuracies no longer further enhances the empirical accuracy.\n",
    "\n",
    "Finally, we'd like to actually build a consensus sequence for each barcodes.\n",
    "There are lots of existing programs with complex error models that use Q-values to build consensus sequences--but we instead use the simple approach implemented in [alignparse.consensus.simple_mutconsensus](https://jbloomlab.github.io/alignparse/alignparse.consensus.html#alignparse.consensus.simple_mutconsensus).\n",
    "The documentation for that function explains the method in detail, but basically it works like this:\n",
    " 1. When there is just one CCS per barcode, the consensus is just that sequence.\n",
    " 2. When there are multiple CCSs per barcode, they are used to build a consensus--however, the entire barcode is discarded if there are many differences between CCSs with the barcode, or high-frequency non-consensus mutations. The reason that barcodes are discarded in such cases as many differences between CCSs or high-frequency non-consensus mutations suggest errors such as barcode collisions or PCR strand exchange.\n",
    " \n",
    "The advantage of this simple method is that it tries to throw away barcodes for which there is likely to be some source of experimental error.\n",
    "\n",
    "Build the consensus sequences:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-05-23T18:11:37.852673Z",
     "iopub.status.busy": "2024-05-23T18:11:37.852375Z",
     "iopub.status.idle": "2024-05-23T18:11:37.868972Z",
     "shell.execute_reply": "2024-05-23T18:11:37.868212Z",
     "shell.execute_reply.started": "2024-05-23T18:11:37.852657Z"
    }
   },
   "outputs": [],
   "source": [
    "consensus, dropped = alignparse.consensus.simple_mutconsensus(\n",
    "    aligned_df.query(\"retained\"),\n",
    "    group_cols=(\"library\", \"barcode\"),\n",
    "    mutation_col=\"gene_mutations\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note how we get back two data frames.\n",
    "The `consensus` data frame simply gives the consensus sequences:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-05-23T18:11:37.870088Z",
     "iopub.status.busy": "2024-05-23T18:11:37.869798Z",
     "iopub.status.idle": "2024-05-23T18:11:37.877165Z",
     "shell.execute_reply": "2024-05-23T18:11:37.876328Z",
     "shell.execute_reply.started": "2024-05-23T18:11:37.870071Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>library</th>\n",
       "      <th>barcode</th>\n",
       "      <th>gene_mutations</th>\n",
       "      <th>variant_call_support</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>lib-1</td>\n",
       "      <td>AAAGTCTGACCGAAAAAA</td>\n",
       "      <td>C154A G509C T510A C823G T824G G825T del763to763</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>lib-1</td>\n",
       "      <td>AACACGTTATAGATGTAG</td>\n",
       "      <td>A52C T53C T54C C85A C87G T277G T278A C296A G297C</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>lib-1</td>\n",
       "      <td>AGACCGGGCGGGGCCTAT</td>\n",
       "      <td>A526G G694A G715C A937G C939G</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>lib-1</td>\n",
       "      <td>AGAGAGATATTAAAAAAA</td>\n",
       "      <td>C22A A23C G24C G34A C35G G36A G400C G402C A631...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>lib-1</td>\n",
       "      <td>ATCTCCTCTCCAAGCCGT</td>\n",
       "      <td>G326C C327A</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  library             barcode  \\\n",
       "0   lib-1  AAAGTCTGACCGAAAAAA   \n",
       "1   lib-1  AACACGTTATAGATGTAG   \n",
       "2   lib-1  AGACCGGGCGGGGCCTAT   \n",
       "3   lib-1  AGAGAGATATTAAAAAAA   \n",
       "4   lib-1  ATCTCCTCTCCAAGCCGT   \n",
       "\n",
       "                                      gene_mutations  variant_call_support  \n",
       "0    C154A G509C T510A C823G T824G G825T del763to763                     1  \n",
       "1   A52C T53C T54C C85A C87G T277G T278A C296A G297C                     3  \n",
       "2                      A526G G694A G715C A937G C939G                     4  \n",
       "3  C22A A23C G24C G34A C35G G36A G400C G402C A631...                     1  \n",
       "4                                        G326C C327A                     1  "
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "consensus.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In addition to the mutations in the consensus, it also gives the `variant_call_support`, which is the number of CCSs supporting the consensus call.\n",
    "When the call support is one, then we expect the accuracy to be equal to the empirical ones we computed above (either with or without indels depending on whether we exclude consensus sequences with indels).\n",
    "When the variant call support is greater than one, the accuracy will be higher as there are more CCSs supporting the consensus call.\n",
    "\n",
    "It is often useful to get the mutations in `consensus` that are just substitutions, and also denote which sequences have indels.\n",
    "That can be done as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-05-23T18:11:37.878326Z",
     "iopub.status.busy": "2024-05-23T18:11:37.878068Z",
     "iopub.status.idle": "2024-05-23T18:11:37.889259Z",
     "shell.execute_reply": "2024-05-23T18:11:37.888548Z",
     "shell.execute_reply.started": "2024-05-23T18:11:37.878308Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>library</th>\n",
       "      <th>barcode</th>\n",
       "      <th>gene_mutations</th>\n",
       "      <th>variant_call_support</th>\n",
       "      <th>substitutions</th>\n",
       "      <th>number_of_indels</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>lib-1</td>\n",
       "      <td>AAAGTCTGACCGAAAAAA</td>\n",
       "      <td>C154A G509C T510A C823G T824G G825T del763to763</td>\n",
       "      <td>1</td>\n",
       "      <td>C154A G509C T510A C823G T824G G825T</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>lib-1</td>\n",
       "      <td>AACACGTTATAGATGTAG</td>\n",
       "      <td>A52C T53C T54C C85A C87G T277G T278A C296A G297C</td>\n",
       "      <td>3</td>\n",
       "      <td>A52C T53C T54C C85A C87G T277G T278A C296A G297C</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>lib-1</td>\n",
       "      <td>AGACCGGGCGGGGCCTAT</td>\n",
       "      <td>A526G G694A G715C A937G C939G</td>\n",
       "      <td>4</td>\n",
       "      <td>A526G G694A G715C A937G C939G</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>lib-1</td>\n",
       "      <td>AGAGAGATATTAAAAAAA</td>\n",
       "      <td>C22A A23C G24C G34A C35G G36A G400C G402C A631...</td>\n",
       "      <td>1</td>\n",
       "      <td>C22A A23C G24C G34A C35G G36A G400C G402C A631...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>lib-1</td>\n",
       "      <td>ATCTCCTCTCCAAGCCGT</td>\n",
       "      <td>G326C C327A</td>\n",
       "      <td>1</td>\n",
       "      <td>G326C C327A</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  library             barcode  \\\n",
       "0   lib-1  AAAGTCTGACCGAAAAAA   \n",
       "1   lib-1  AACACGTTATAGATGTAG   \n",
       "2   lib-1  AGACCGGGCGGGGCCTAT   \n",
       "3   lib-1  AGAGAGATATTAAAAAAA   \n",
       "4   lib-1  ATCTCCTCTCCAAGCCGT   \n",
       "\n",
       "                                      gene_mutations  variant_call_support  \\\n",
       "0    C154A G509C T510A C823G T824G G825T del763to763                     1   \n",
       "1   A52C T53C T54C C85A C87G T277G T278A C296A G297C                     3   \n",
       "2                      A526G G694A G715C A937G C939G                     4   \n",
       "3  C22A A23C G24C G34A C35G G36A G400C G402C A631...                     1   \n",
       "4                                        G326C C327A                     1   \n",
       "\n",
       "                                       substitutions  number_of_indels  \n",
       "0                C154A G509C T510A C823G T824G G825T                 1  \n",
       "1   A52C T53C T54C C85A C87G T277G T278A C296A G297C                 0  \n",
       "2                      A526G G694A G715C A937G C939G                 0  \n",
       "3  C22A A23C G24C G34A C35G G36A G400C G402C A631...                 0  \n",
       "4                                        G326C C327A                 0  "
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "consensus = alignparse.consensus.add_mut_info_cols(\n",
    "    consensus,\n",
    "    mutation_col=\"gene_mutations\",\n",
    "    sub_str_col=\"substitutions\",\n",
    "    n_indel_col=\"number_of_indels\",\n",
    "    overwrite_cols=True,\n",
    ")\n",
    "\n",
    "consensus.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you filter the data frame above for just those barcodes with no indels, you can then pass the data frame directly into a [dms_variants.codonvarianttable.CodonVariantTable](https://jbloomlab.github.io/dms_variants/dms_variants.codonvarianttable.html#dms_variants.codonvarianttable.CodonVariantTable) for further analysis.\n",
    "\n",
    "You can also look at what happened to the barcodes for which we could **not** build consensus sequences by looking at the `dropped` data frame:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-05-23T18:11:37.890523Z",
     "iopub.status.busy": "2024-05-23T18:11:37.890255Z",
     "iopub.status.idle": "2024-05-23T18:11:37.897393Z",
     "shell.execute_reply": "2024-05-23T18:11:37.896717Z",
     "shell.execute_reply.started": "2024-05-23T18:11:37.890505Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>library</th>\n",
       "      <th>barcode</th>\n",
       "      <th>drop_reason</th>\n",
       "      <th>nseqs</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>lib-1</td>\n",
       "      <td>CTATACCCAAATTAATAA</td>\n",
       "      <td>subs diff too large</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>lib-1</td>\n",
       "      <td>CTGATTTGGCTTTATTTT</td>\n",
       "      <td>subs diff too large</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>lib-2</td>\n",
       "      <td>CTGATATACGTACGCAAC</td>\n",
       "      <td>subs diff too large</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>lib-2</td>\n",
       "      <td>TACCCTGCCTCGCCGAAC</td>\n",
       "      <td>subs diff too large</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  library             barcode          drop_reason  nseqs\n",
       "0   lib-1  CTATACCCAAATTAATAA  subs diff too large      2\n",
       "1   lib-1  CTGATTTGGCTTTATTTT  subs diff too large      2\n",
       "2   lib-2  CTGATATACGTACGCAAC  subs diff too large      3\n",
       "3   lib-2  TACCCTGCCTCGCCGAAC  subs diff too large      2"
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dropped"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Or to summarize the drop reasons:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-05-23T18:11:37.898482Z",
     "iopub.status.busy": "2024-05-23T18:11:37.898192Z",
     "iopub.status.idle": "2024-05-23T18:11:37.906206Z",
     "shell.execute_reply": "2024-05-23T18:11:37.905304Z",
     "shell.execute_reply.started": "2024-05-23T18:11:37.898465Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>drop_reason</th>\n",
       "      <th>number_barcodes</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>subs diff too large</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           drop_reason  number_barcodes\n",
       "0  subs diff too large                4"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(dropped.groupby(\"drop_reason\").size().rename(\"number_barcodes\").reset_index())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that only a few barcodes were dropped, and that in all cases the reason was that the differences in number of substitutions among CCSs within the barcode was too large."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.9"
  },
  "toc": {
   "nav_menu": {},
   "number_sections": false,
   "sideBar": true,
   "skip_h1_title": false,
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": true
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}