{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "# Simulate experiment\n",
    "We will simulate data from a deep mutational scanning experiment using the RBD antibody mix in order to provide hypothetical data on which to fit a `Polyclonal` model.\n",
    "\n",
    "## Simulate three epitopes\n",
    "\n",
    "We first define a `Polyclonal` object that represents the antibody mix we want to simulate.\n",
    "This mix is again based on the three RBD antibodies targeting class 1, 2, and 3 epitopes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-02-20T14:50:45.271294Z",
     "iopub.status.busy": "2023-02-20T14:50:45.271155Z",
     "iopub.status.idle": "2023-02-20T14:50:45.320766Z",
     "shell.execute_reply": "2023-02-20T14:50:45.320128Z",
     "shell.execute_reply.started": "2023-02-20T14:50:45.271280Z"
    },
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epitopes: ('class 1', 'class 2', 'class 3')\n",
      "Number of mutations: 1932\n",
      "Number of sites: 173\n"
     ]
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "import polyclonal\n",
    "\n",
    "# read data needed to initialize `Polyclonal` object\n",
    "activity_wt_df = pd.read_csv(\"RBD_activity_wt_df.csv\")\n",
    "mut_escape_df = pd.read_csv(\"RBD_mut_escape_df.csv\")\n",
    "\n",
    "# create `Polyclonal` object with actual antibody mix\n",
    "model_3ep = polyclonal.Polyclonal(\n",
    "    activity_wt_df=activity_wt_df, mut_escape_df=mut_escape_df\n",
    ")\n",
    "\n",
    "print(f\"Epitopes: {model_3ep.epitopes}\")\n",
    "print(f\"Number of mutations: {len(model_3ep.mutations)}\")\n",
    "print(f\"Number of sites: {len(model_3ep.sites)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we simulate libraries of variants with an average of 1, 2, 3, or 4 mutations per gene.\n",
    "These libraries only contain mutations for which we defined mutation escape above.\n",
    "We a [dms_variants](https://jbloomlab.github.io/dms_variants/) `CodonVariantTable` for the RBD with these mutations.\n",
    "Note that because the `CodonVariantTable` requires sequential 1, 2, ... numbering, we have to do some shifting of sites back and forth with an offset of 331 since that is where the RBD starts."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-02-20T14:50:45.321789Z",
     "iopub.status.busy": "2023-02-20T14:50:45.321525Z",
     "iopub.status.idle": "2023-02-20T14:51:02.517726Z",
     "shell.execute_reply": "2023-02-20T14:51:02.517131Z",
     "shell.execute_reply.started": "2023-02-20T14:50:45.321772Z"
    },
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "There are 1932 allowed amino-acid mutations.\n"
     ]
    }
   ],
   "source": [
    "import Bio.SeqIO\n",
    "\n",
    "import dms_variants.simulate\n",
    "\n",
    "import polyclonal.utils\n",
    "\n",
    "# read coding sequence, then slice to just region of interest\n",
    "# noting that RBD starts at residue 331\n",
    "geneseq = str(Bio.SeqIO.read(\"RBD_seq.fasta\", \"fasta\").seq)\n",
    "\n",
    "allowed_aa_muts = model_3ep.mut_escape_df[\"mutation\"].unique()\n",
    "print(f\"There are {len(allowed_aa_muts)} allowed amino-acid mutations.\")\n",
    "\n",
    "variants = dms_variants.simulate.simulate_CodonVariantTable(\n",
    "    geneseq=geneseq,\n",
    "    bclen=16,\n",
    "    library_specs={\n",
    "        f\"avg{m}muts\": {\"avgmuts\": m, \"nvariants\": 40000} for m in [1, 2, 3, 4]\n",
    "    },\n",
    "    allowed_aa_muts=[polyclonal.utils.shift_mut_site(m, -330) for m in allowed_aa_muts],\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now look at the distribution of the number of amino-acid mutations per-variant in the library.\n",
    "The mutation rate is relatively high as we pre-screened for tolerated mutations and need multiple mutants to decompose the sera mix:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-02-20T14:51:02.518616Z",
     "iopub.status.busy": "2023-02-20T14:51:02.518458Z",
     "iopub.status.idle": "2023-02-20T14:51:03.049318Z",
     "shell.execute_reply": "2023-02-20T14:51:03.048507Z",
     "shell.execute_reply.started": "2023-02-20T14:51:02.518602Z"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "# NBVAL_IGNORE_OUTPUT\n",
    "\n",
    "p = variants.plotNumMutsHistogram(\n",
    "    mut_type=\"aa\", samples=None, libraries=variants.libraries\n",
    ")\n",
    "_ = p.draw()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We also look at the mutation rate across the gene.\n",
    "Note that this is sequential numbering of the sequence.\n",
    "Also, the per-site mutation frequency is uneven because we only include tolerated mutations, and there are different number of tolerated mutations per site:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-02-20T14:51:03.050661Z",
     "iopub.status.busy": "2023-02-20T14:51:03.050362Z",
     "iopub.status.idle": "2023-02-20T14:51:04.375390Z",
     "shell.execute_reply": "2023-02-20T14:51:04.374643Z",
     "shell.execute_reply.started": "2023-02-20T14:51:03.050643Z"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "# NBVAL_IGNORE_OUTPUT\n",
    "\n",
    "p = variants.plotMutFreqs(\n",
    "    variant_type=\"all\", mut_type=\"aa\", samples=None, libraries=variants.libraries\n",
    ")\n",
    "_ = p.draw()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we get the data frame with the amino-acid substitutions for each variant, re-adding site offset to get back to RBD numbering:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-02-20T14:51:04.376459Z",
     "iopub.status.busy": "2023-02-20T14:51:04.376197Z",
     "iopub.status.idle": "2023-02-20T14:51:04.916626Z",
     "shell.execute_reply": "2023-02-20T14:51:04.915960Z",
     "shell.execute_reply.started": "2023-02-20T14:51:04.376443Z"
    },
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>library</th>\n",
       "      <th>barcode</th>\n",
       "      <th>aa_substitutions</th>\n",
       "      <th>n_aa_substitutions</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>avg1muts</td>\n",
       "      <td>AAAAAAAATTTACGCG</td>\n",
       "      <td>F486P</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>avg1muts</td>\n",
       "      <td>AAAAAAATCCTCAATT</td>\n",
       "      <td>T385A</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>avg1muts</td>\n",
       "      <td>AAAAAACTATGCACCT</td>\n",
       "      <td>Q498N</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>avg1muts</td>\n",
       "      <td>AAAAAAGACGCTGTGC</td>\n",
       "      <td>P337S</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>avg1muts</td>\n",
       "      <td>AAAAAATAGTTCTGAT</td>\n",
       "      <td>S371F R408V</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>159995</th>\n",
       "      <td>avg4muts</td>\n",
       "      <td>TTTTTTGCGTTTATCT</td>\n",
       "      <td>V367T</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>159996</th>\n",
       "      <td>avg4muts</td>\n",
       "      <td>TTTTTTTCCGCTCTAA</td>\n",
       "      <td>Y369R C391T N440K G447C K462G</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>159997</th>\n",
       "      <td>avg4muts</td>\n",
       "      <td>TTTTTTTGTAAGCGAG</td>\n",
       "      <td>N334L A348S K386S S443G Q493N V503S</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>159998</th>\n",
       "      <td>avg4muts</td>\n",
       "      <td>TTTTTTTGTTTTCAGC</td>\n",
       "      <td>L335A P337S R357E Y396H E406W G485R T500E</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>159999</th>\n",
       "      <td>avg4muts</td>\n",
       "      <td>TTTTTTTTGCGGCCGT</td>\n",
       "      <td>K444A P527S</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>160000 rows × 4 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "         library           barcode                           aa_substitutions  \\\n",
       "0       avg1muts  AAAAAAAATTTACGCG                                      F486P   \n",
       "1       avg1muts  AAAAAAATCCTCAATT                                      T385A   \n",
       "2       avg1muts  AAAAAACTATGCACCT                                      Q498N   \n",
       "3       avg1muts  AAAAAAGACGCTGTGC                                      P337S   \n",
       "4       avg1muts  AAAAAATAGTTCTGAT                                S371F R408V   \n",
       "...          ...               ...                                        ...   \n",
       "159995  avg4muts  TTTTTTGCGTTTATCT                                      V367T   \n",
       "159996  avg4muts  TTTTTTTCCGCTCTAA              Y369R C391T N440K G447C K462G   \n",
       "159997  avg4muts  TTTTTTTGTAAGCGAG        N334L A348S K386S S443G Q493N V503S   \n",
       "159998  avg4muts  TTTTTTTGTTTTCAGC  L335A P337S R357E Y396H E406W G485R T500E   \n",
       "159999  avg4muts  TTTTTTTTGCGGCCGT                                K444A P527S   \n",
       "\n",
       "        n_aa_substitutions  \n",
       "0                        1  \n",
       "1                        1  \n",
       "2                        1  \n",
       "3                        1  \n",
       "4                        2  \n",
       "...                    ...  \n",
       "159995                   1  \n",
       "159996                   5  \n",
       "159997                   6  \n",
       "159998                   7  \n",
       "159999                   2  \n",
       "\n",
       "[160000 rows x 4 columns]"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "variants_df = variants.barcode_variant_df[\n",
    "    [\"library\", \"barcode\", \"aa_substitutions\", \"n_aa_substitutions\"]\n",
    "].assign(\n",
    "    aa_substitutions=lambda x: x[\"aa_substitutions\"].apply(\n",
    "        polyclonal.utils.shift_mut_site, shift=330\n",
    "    )\n",
    ")\n",
    "\n",
    "variants_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we use our `Polyclonal` object to compute the predicted probability of escape ($p_v\\left(c\\right)$) at multiple serum concentrations given the mutation escape values $\\beta_{m,e}$ and activities $a_{\\rm{wt},e}$ stored in the `Polyclonal` object.\n",
    "Since we will use these as the \"true\" values in our our simulation, we then rename the column with the predicted probabilities of escape to just be the probabilities of escape:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-02-20T14:51:04.917631Z",
     "iopub.status.busy": "2023-02-20T14:51:04.917402Z",
     "iopub.status.idle": "2023-02-20T14:51:10.388961Z",
     "shell.execute_reply": "2023-02-20T14:51:10.388330Z",
     "shell.execute_reply.started": "2023-02-20T14:51:04.917616Z"
    },
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>library</th>\n",
       "      <th>barcode</th>\n",
       "      <th>aa_substitutions</th>\n",
       "      <th>n_aa_substitutions</th>\n",
       "      <th>concentration</th>\n",
       "      <th>prob_escape</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>avg1muts</td>\n",
       "      <td>AAAAAATGTTCTATCC</td>\n",
       "      <td></td>\n",
       "      <td>0</td>\n",
       "      <td>0.125</td>\n",
       "      <td>0.048594</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>avg1muts</td>\n",
       "      <td>AAAAACAATCCGGACT</td>\n",
       "      <td></td>\n",
       "      <td>0</td>\n",
       "      <td>0.125</td>\n",
       "      <td>0.048594</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>avg1muts</td>\n",
       "      <td>AAAAACGCGGTCACTT</td>\n",
       "      <td></td>\n",
       "      <td>0</td>\n",
       "      <td>0.125</td>\n",
       "      <td>0.048594</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>avg1muts</td>\n",
       "      <td>AAAAACTTGGCTAGCT</td>\n",
       "      <td></td>\n",
       "      <td>0</td>\n",
       "      <td>0.125</td>\n",
       "      <td>0.048594</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>avg1muts</td>\n",
       "      <td>AAAAAGCAAGGCCCAG</td>\n",
       "      <td></td>\n",
       "      <td>0</td>\n",
       "      <td>0.125</td>\n",
       "      <td>0.048594</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>959995</th>\n",
       "      <td>avg2muts</td>\n",
       "      <td>ATTCACCACCAGGTAG</td>\n",
       "      <td>Y508W</td>\n",
       "      <td>1</td>\n",
       "      <td>4.000</td>\n",
       "      <td>0.000006</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>959996</th>\n",
       "      <td>avg2muts</td>\n",
       "      <td>TAGAATGTATTATTCT</td>\n",
       "      <td>Y508W</td>\n",
       "      <td>1</td>\n",
       "      <td>4.000</td>\n",
       "      <td>0.000006</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>959997</th>\n",
       "      <td>avg3muts</td>\n",
       "      <td>CTTAAAATAGCTGGTC</td>\n",
       "      <td>Y508W</td>\n",
       "      <td>1</td>\n",
       "      <td>4.000</td>\n",
       "      <td>0.000006</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>959998</th>\n",
       "      <td>avg4muts</td>\n",
       "      <td>GGTCAATTATGTCGGG</td>\n",
       "      <td>Y508W</td>\n",
       "      <td>1</td>\n",
       "      <td>4.000</td>\n",
       "      <td>0.000006</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>959999</th>\n",
       "      <td>avg1muts</td>\n",
       "      <td>GGAACGACAGTGATCG</td>\n",
       "      <td>Y508W T531H</td>\n",
       "      <td>2</td>\n",
       "      <td>4.000</td>\n",
       "      <td>0.000006</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>960000 rows × 6 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "         library           barcode aa_substitutions  n_aa_substitutions  \\\n",
       "0       avg1muts  AAAAAATGTTCTATCC                                    0   \n",
       "1       avg1muts  AAAAACAATCCGGACT                                    0   \n",
       "2       avg1muts  AAAAACGCGGTCACTT                                    0   \n",
       "3       avg1muts  AAAAACTTGGCTAGCT                                    0   \n",
       "4       avg1muts  AAAAAGCAAGGCCCAG                                    0   \n",
       "...          ...               ...              ...                 ...   \n",
       "959995  avg2muts  ATTCACCACCAGGTAG            Y508W                   1   \n",
       "959996  avg2muts  TAGAATGTATTATTCT            Y508W                   1   \n",
       "959997  avg3muts  CTTAAAATAGCTGGTC            Y508W                   1   \n",
       "959998  avg4muts  GGTCAATTATGTCGGG            Y508W                   1   \n",
       "959999  avg1muts  GGAACGACAGTGATCG      Y508W T531H                   2   \n",
       "\n",
       "        concentration  prob_escape  \n",
       "0               0.125     0.048594  \n",
       "1               0.125     0.048594  \n",
       "2               0.125     0.048594  \n",
       "3               0.125     0.048594  \n",
       "4               0.125     0.048594  \n",
       "...               ...          ...  \n",
       "959995          4.000     0.000006  \n",
       "959996          4.000     0.000006  \n",
       "959997          4.000     0.000006  \n",
       "959998          4.000     0.000006  \n",
       "959999          4.000     0.000006  \n",
       "\n",
       "[960000 rows x 6 columns]"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "variants_escape = model_3ep.prob_escape(\n",
    "    variants_df=variants_df, concentrations=[0.125, 0.25, 0.5, 1, 2, 4]\n",
    ")\n",
    "\n",
    "variants_escape = variants_escape.rename(\n",
    "    columns={\"predicted_prob_escape\": \"prob_escape\"}\n",
    ")\n",
    "\n",
    "variants_escape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Plot some features of the variants, namely the distribution of probability of escape for each number of mutations.\n",
    "As expected, variants with many mutations are more likely to escape the serum:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-02-20T14:51:10.389993Z",
     "iopub.status.busy": "2023-02-20T14:51:10.389721Z",
     "iopub.status.idle": "2023-02-20T14:51:25.200077Z",
     "shell.execute_reply": "2023-02-20T14:51:25.199311Z",
     "shell.execute_reply.started": "2023-02-20T14:51:10.389976Z"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "# NBVAL_IGNORE_OUTPUT\n",
    "\n",
    "from plotnine import *\n",
    "\n",
    "p = (\n",
    "    ggplot(\n",
    "        variants_escape.assign(\n",
    "            n_aa_substitutions=lambda x: x[\"n_aa_substitutions\"].map(\n",
    "                lambda n: str(n) if n <= 6 else \">6\"\n",
    "            ),\n",
    "            concentration=lambda x: pd.Categorical(x[\"concentration\"]),\n",
    "        )\n",
    "    )\n",
    "    + aes(\"n_aa_substitutions\", \"prob_escape\", fill=\"concentration\")\n",
    "    + geom_boxplot(outlier_size=0.5, outlier_alpha=0.1)\n",
    "    + facet_wrap(\"~ library\", ncol=1)\n",
    "    + theme_classic()\n",
    "    + theme(figure_size=(9, 7))\n",
    ")\n",
    "\n",
    "_ = p.draw()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Also, just compute the overall average probability of escape for each library:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-02-20T14:51:25.201411Z",
     "iopub.status.busy": "2023-02-20T14:51:25.201225Z",
     "iopub.status.idle": "2023-02-20T14:51:25.254851Z",
     "shell.execute_reply": "2023-02-20T14:51:25.254192Z",
     "shell.execute_reply.started": "2023-02-20T14:51:25.201396Z"
    },
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>prob_escape</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>library</th>\n",
       "      <th>concentration</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"6\" valign=\"top\">avg1muts</th>\n",
       "      <th>0.125</th>\n",
       "      <td>0.086326</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.250</th>\n",
       "      <td>0.033071</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.500</th>\n",
       "      <td>0.012991</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1.000</th>\n",
       "      <td>0.005777</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2.000</th>\n",
       "      <td>0.003006</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4.000</th>\n",
       "      <td>0.001809</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"6\" valign=\"top\">avg2muts</th>\n",
       "      <th>0.125</th>\n",
       "      <td>0.125877</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.250</th>\n",
       "      <td>0.060015</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.500</th>\n",
       "      <td>0.030075</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1.000</th>\n",
       "      <td>0.016740</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2.000</th>\n",
       "      <td>0.010386</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4.000</th>\n",
       "      <td>0.007009</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"6\" valign=\"top\">avg3muts</th>\n",
       "      <th>0.125</th>\n",
       "      <td>0.165698</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.250</th>\n",
       "      <td>0.090470</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.500</th>\n",
       "      <td>0.052089</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1.000</th>\n",
       "      <td>0.032763</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2.000</th>\n",
       "      <td>0.022442</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4.000</th>\n",
       "      <td>0.016411</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"6\" valign=\"top\">avg4muts</th>\n",
       "      <th>0.125</th>\n",
       "      <td>0.204486</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.250</th>\n",
       "      <td>0.122522</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.500</th>\n",
       "      <td>0.077355</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1.000</th>\n",
       "      <td>0.052786</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2.000</th>\n",
       "      <td>0.038664</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4.000</th>\n",
       "      <td>0.029815</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                        prob_escape\n",
       "library  concentration             \n",
       "avg1muts 0.125             0.086326\n",
       "         0.250             0.033071\n",
       "         0.500             0.012991\n",
       "         1.000             0.005777\n",
       "         2.000             0.003006\n",
       "         4.000             0.001809\n",
       "avg2muts 0.125             0.125877\n",
       "         0.250             0.060015\n",
       "         0.500             0.030075\n",
       "         1.000             0.016740\n",
       "         2.000             0.010386\n",
       "         4.000             0.007009\n",
       "avg3muts 0.125             0.165698\n",
       "         0.250             0.090470\n",
       "         0.500             0.052089\n",
       "         1.000             0.032763\n",
       "         2.000             0.022442\n",
       "         4.000             0.016411\n",
       "avg4muts 0.125             0.204486\n",
       "         0.250             0.122522\n",
       "         0.500             0.077355\n",
       "         1.000             0.052786\n",
       "         2.000             0.038664\n",
       "         4.000             0.029815"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(\n",
    "    variants_escape.groupby([\"library\", \"concentration\"]).aggregate(\n",
    "        {\"prob_escape\": \"mean\"}\n",
    "    )\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We also want to acknowledge the fact that a real experiment will have some noise in the data. This noise is likely to come in two forms:\n",
    "  \n",
    "  1. Inaccuracies in the measurements of the probabilities of escape, $p_v\\left(c\\right)$.\n",
    "  2. Occassional mis-assignment of which mutations are in a variant due to sequencing errors.\n",
    "  \n",
    "We therefore will make a \"noisy\" version of `variants_df` where we have incorporated both of these sources of noise by adding Gaussian measurement error to the escape probabilities (but then truncating them to be between 0 and 1), and by adding or subtracting a mutation to occassional variants to represent sequencing errors."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-02-20T14:51:25.255663Z",
     "iopub.status.busy": "2023-02-20T14:51:25.255491Z",
     "iopub.status.idle": "2023-02-20T14:51:25.265452Z",
     "shell.execute_reply": "2023-02-20T14:51:25.264832Z",
     "shell.execute_reply.started": "2023-02-20T14:51:25.255647Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>library</th>\n",
       "      <th>barcode</th>\n",
       "      <th>aa_substitutions</th>\n",
       "      <th>n_aa_substitutions</th>\n",
       "      <th>concentration</th>\n",
       "      <th>prob_escape</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>avg1muts</td>\n",
       "      <td>AAAAAATGTTCTATCC</td>\n",
       "      <td></td>\n",
       "      <td>0</td>\n",
       "      <td>0.125</td>\n",
       "      <td>0.048594</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>avg1muts</td>\n",
       "      <td>AAAAACAATCCGGACT</td>\n",
       "      <td></td>\n",
       "      <td>0</td>\n",
       "      <td>0.125</td>\n",
       "      <td>0.048594</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>avg1muts</td>\n",
       "      <td>AAAAACGCGGTCACTT</td>\n",
       "      <td></td>\n",
       "      <td>0</td>\n",
       "      <td>0.125</td>\n",
       "      <td>0.048594</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>avg1muts</td>\n",
       "      <td>AAAAACTTGGCTAGCT</td>\n",
       "      <td></td>\n",
       "      <td>0</td>\n",
       "      <td>0.125</td>\n",
       "      <td>0.048594</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>avg1muts</td>\n",
       "      <td>AAAAAGCAAGGCCCAG</td>\n",
       "      <td></td>\n",
       "      <td>0</td>\n",
       "      <td>0.125</td>\n",
       "      <td>0.048594</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>959995</th>\n",
       "      <td>avg2muts</td>\n",
       "      <td>ATTCACCACCAGGTAG</td>\n",
       "      <td>Y508W</td>\n",
       "      <td>1</td>\n",
       "      <td>4.000</td>\n",
       "      <td>0.000006</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>959996</th>\n",
       "      <td>avg2muts</td>\n",
       "      <td>TAGAATGTATTATTCT</td>\n",
       "      <td>Y508W</td>\n",
       "      <td>1</td>\n",
       "      <td>4.000</td>\n",
       "      <td>0.000006</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>959997</th>\n",
       "      <td>avg3muts</td>\n",
       "      <td>CTTAAAATAGCTGGTC</td>\n",
       "      <td>Y508W</td>\n",
       "      <td>1</td>\n",
       "      <td>4.000</td>\n",
       "      <td>0.000006</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>959998</th>\n",
       "      <td>avg4muts</td>\n",
       "      <td>GGTCAATTATGTCGGG</td>\n",
       "      <td>Y508W</td>\n",
       "      <td>1</td>\n",
       "      <td>4.000</td>\n",
       "      <td>0.000006</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>959999</th>\n",
       "      <td>avg1muts</td>\n",
       "      <td>GGAACGACAGTGATCG</td>\n",
       "      <td>Y508W T531H</td>\n",
       "      <td>2</td>\n",
       "      <td>4.000</td>\n",
       "      <td>0.000006</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>960000 rows × 6 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "         library           barcode aa_substitutions  n_aa_substitutions  \\\n",
       "0       avg1muts  AAAAAATGTTCTATCC                                    0   \n",
       "1       avg1muts  AAAAACAATCCGGACT                                    0   \n",
       "2       avg1muts  AAAAACGCGGTCACTT                                    0   \n",
       "3       avg1muts  AAAAACTTGGCTAGCT                                    0   \n",
       "4       avg1muts  AAAAAGCAAGGCCCAG                                    0   \n",
       "...          ...               ...              ...                 ...   \n",
       "959995  avg2muts  ATTCACCACCAGGTAG            Y508W                   1   \n",
       "959996  avg2muts  TAGAATGTATTATTCT            Y508W                   1   \n",
       "959997  avg3muts  CTTAAAATAGCTGGTC            Y508W                   1   \n",
       "959998  avg4muts  GGTCAATTATGTCGGG            Y508W                   1   \n",
       "959999  avg1muts  GGAACGACAGTGATCG      Y508W T531H                   2   \n",
       "\n",
       "        concentration  prob_escape  \n",
       "0               0.125     0.048594  \n",
       "1               0.125     0.048594  \n",
       "2               0.125     0.048594  \n",
       "3               0.125     0.048594  \n",
       "4               0.125     0.048594  \n",
       "...               ...          ...  \n",
       "959995          4.000     0.000006  \n",
       "959996          4.000     0.000006  \n",
       "959997          4.000     0.000006  \n",
       "959998          4.000     0.000006  \n",
       "959999          4.000     0.000006  \n",
       "\n",
       "[960000 rows x 6 columns]"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "variants_escape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-02-20T14:51:25.266524Z",
     "iopub.status.busy": "2023-02-20T14:51:25.266259Z",
     "iopub.status.idle": "2023-02-20T14:51:27.418744Z",
     "shell.execute_reply": "2023-02-20T14:51:27.418034Z",
     "shell.execute_reply.started": "2023-02-20T14:51:25.266505Z"
    },
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>library</th>\n",
       "      <th>barcode</th>\n",
       "      <th>concentration</th>\n",
       "      <th>prob_escape</th>\n",
       "      <th>aa_substitutions</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>avg1muts</td>\n",
       "      <td>AAAAAATGTTCTATCC</td>\n",
       "      <td>0.125</td>\n",
       "      <td>0.097324</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>avg1muts</td>\n",
       "      <td>AAAAACAATCCGGACT</td>\n",
       "      <td>0.125</td>\n",
       "      <td>0.030241</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>avg1muts</td>\n",
       "      <td>AAAAACGCGGTCACTT</td>\n",
       "      <td>0.125</td>\n",
       "      <td>0.032749</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>avg1muts</td>\n",
       "      <td>AAAAACTTGGCTAGCT</td>\n",
       "      <td>0.125</td>\n",
       "      <td>0.016405</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>avg1muts</td>\n",
       "      <td>AAAAAGCAAGGCCCAG</td>\n",
       "      <td>0.125</td>\n",
       "      <td>0.074556</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>959995</th>\n",
       "      <td>avg2muts</td>\n",
       "      <td>ATTCACCACCAGGTAG</td>\n",
       "      <td>4.000</td>\n",
       "      <td>0.074859</td>\n",
       "      <td>Y508W</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>959996</th>\n",
       "      <td>avg2muts</td>\n",
       "      <td>TAGAATGTATTATTCT</td>\n",
       "      <td>4.000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>Y508W</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>959997</th>\n",
       "      <td>avg3muts</td>\n",
       "      <td>CTTAAAATAGCTGGTC</td>\n",
       "      <td>4.000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>Y508W</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>959998</th>\n",
       "      <td>avg4muts</td>\n",
       "      <td>GGTCAATTATGTCGGG</td>\n",
       "      <td>4.000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>Y508W</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>959999</th>\n",
       "      <td>avg1muts</td>\n",
       "      <td>GGAACGACAGTGATCG</td>\n",
       "      <td>4.000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>Y508W T531H</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>960000 rows × 5 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "         library           barcode  concentration  prob_escape  \\\n",
       "0       avg1muts  AAAAAATGTTCTATCC          0.125     0.097324   \n",
       "1       avg1muts  AAAAACAATCCGGACT          0.125     0.030241   \n",
       "2       avg1muts  AAAAACGCGGTCACTT          0.125     0.032749   \n",
       "3       avg1muts  AAAAACTTGGCTAGCT          0.125     0.016405   \n",
       "4       avg1muts  AAAAAGCAAGGCCCAG          0.125     0.074556   \n",
       "...          ...               ...            ...          ...   \n",
       "959995  avg2muts  ATTCACCACCAGGTAG          4.000     0.074859   \n",
       "959996  avg2muts  TAGAATGTATTATTCT          4.000     0.000000   \n",
       "959997  avg3muts  CTTAAAATAGCTGGTC          4.000     0.000000   \n",
       "959998  avg4muts  GGTCAATTATGTCGGG          4.000     0.000000   \n",
       "959999  avg1muts  GGAACGACAGTGATCG          4.000     0.000000   \n",
       "\n",
       "       aa_substitutions  \n",
       "0                        \n",
       "1                        \n",
       "2                        \n",
       "3                        \n",
       "4                        \n",
       "...                 ...  \n",
       "959995            Y508W  \n",
       "959996            Y508W  \n",
       "959997            Y508W  \n",
       "959998            Y508W  \n",
       "959999      Y508W T531H  \n",
       "\n",
       "[960000 rows x 5 columns]"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import numpy\n",
    "\n",
    "numpy.random.seed(1)  # seed for reproducible output\n",
    "\n",
    "\n",
    "def add_subtract_mutation(subs):\n",
    "    \"\"\"Sometimes add or subtract a mutation.\"\"\"\n",
    "    subs = subs.split()\n",
    "    sub_sites = [int(sub[1:-1]) for sub in subs]\n",
    "    rand = numpy.random.random()\n",
    "    if rand < 0.01:  # add mutation with 1% probability\n",
    "        mut = numpy.random.choice(model_3ep.mutations)\n",
    "        while int(mut[1:-1]) in sub_sites:\n",
    "            mut = numpy.random.choice(model_3ep.mutations)\n",
    "        subs.append(mut)\n",
    "    elif rand > 0.99 and subs:  # subtract mutation with 1% probability\n",
    "        subs.pop(numpy.random.randint(len(subs)))\n",
    "    return \" \".join(subs)\n",
    "\n",
    "\n",
    "# only keep needed columns in variants_escape\n",
    "variants_escape = variants_escape.drop(columns=\"n_aa_substitutions\")\n",
    "\n",
    "# simulate noisy variants\n",
    "noisy_variants_escape = (\n",
    "    variants_escape.assign(\n",
    "        # add Gaussian noise with standard deviation of 0.03\n",
    "        prob_escape=lambda x: (\n",
    "            x[\"prob_escape\"] + numpy.random.normal(scale=0.03, size=len(x))\n",
    "        ).clip(lower=0, upper=1),\n",
    "    )\n",
    "    # add rare sequencing errors to variants\n",
    "    .drop(columns=\"aa_substitutions\").merge(\n",
    "        variants_df[[\"library\", \"barcode\", \"aa_substitutions\"]].assign(\n",
    "            aa_substitutions=lambda x: x[\"aa_substitutions\"].map(add_subtract_mutation)\n",
    "        ),\n",
    "        how=\"left\",\n",
    "        on=[\"library\", \"barcode\"],\n",
    "        validate=\"many_to_one\",\n",
    "    )\n",
    ")\n",
    "\n",
    "noisy_variants_escape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For both the exact and noisy data frames, we will also compute the IC90 for each variant based on the simulated antibody mix:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-02-20T14:51:27.419813Z",
     "iopub.status.busy": "2023-02-20T14:51:27.419542Z",
     "iopub.status.idle": "2023-02-20T14:52:22.414051Z",
     "shell.execute_reply": "2023-02-20T14:52:22.413140Z",
     "shell.execute_reply.started": "2023-02-20T14:51:27.419796Z"
    },
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>library</th>\n",
       "      <th>barcode</th>\n",
       "      <th>concentration</th>\n",
       "      <th>prob_escape</th>\n",
       "      <th>aa_substitutions</th>\n",
       "      <th>IC90</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>avg1muts</td>\n",
       "      <td>AAAAAATGTTCTATCC</td>\n",
       "      <td>0.125</td>\n",
       "      <td>0.097324</td>\n",
       "      <td></td>\n",
       "      <td>0.082119</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>avg1muts</td>\n",
       "      <td>AAAAACAATCCGGACT</td>\n",
       "      <td>0.125</td>\n",
       "      <td>0.030241</td>\n",
       "      <td></td>\n",
       "      <td>0.082119</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>avg1muts</td>\n",
       "      <td>AAAAACGCGGTCACTT</td>\n",
       "      <td>0.125</td>\n",
       "      <td>0.032749</td>\n",
       "      <td></td>\n",
       "      <td>0.082119</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>avg1muts</td>\n",
       "      <td>AAAAACTTGGCTAGCT</td>\n",
       "      <td>0.125</td>\n",
       "      <td>0.016405</td>\n",
       "      <td></td>\n",
       "      <td>0.082119</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>avg1muts</td>\n",
       "      <td>AAAAAGCAAGGCCCAG</td>\n",
       "      <td>0.125</td>\n",
       "      <td>0.074556</td>\n",
       "      <td></td>\n",
       "      <td>0.082119</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>959995</th>\n",
       "      <td>avg1muts</td>\n",
       "      <td>GGAACGACAGTGATCG</td>\n",
       "      <td>0.250</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>Y508W T531H</td>\n",
       "      <td>0.082119</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>959996</th>\n",
       "      <td>avg1muts</td>\n",
       "      <td>GGAACGACAGTGATCG</td>\n",
       "      <td>0.500</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>Y508W T531H</td>\n",
       "      <td>0.082119</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>959997</th>\n",
       "      <td>avg1muts</td>\n",
       "      <td>GGAACGACAGTGATCG</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>Y508W T531H</td>\n",
       "      <td>0.082119</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>959998</th>\n",
       "      <td>avg1muts</td>\n",
       "      <td>GGAACGACAGTGATCG</td>\n",
       "      <td>2.000</td>\n",
       "      <td>0.021968</td>\n",
       "      <td>Y508W T531H</td>\n",
       "      <td>0.082119</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>959999</th>\n",
       "      <td>avg1muts</td>\n",
       "      <td>GGAACGACAGTGATCG</td>\n",
       "      <td>4.000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>Y508W T531H</td>\n",
       "      <td>0.082119</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>960000 rows × 6 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "         library           barcode  concentration  prob_escape  \\\n",
       "0       avg1muts  AAAAAATGTTCTATCC          0.125     0.097324   \n",
       "1       avg1muts  AAAAACAATCCGGACT          0.125     0.030241   \n",
       "2       avg1muts  AAAAACGCGGTCACTT          0.125     0.032749   \n",
       "3       avg1muts  AAAAACTTGGCTAGCT          0.125     0.016405   \n",
       "4       avg1muts  AAAAAGCAAGGCCCAG          0.125     0.074556   \n",
       "...          ...               ...            ...          ...   \n",
       "959995  avg1muts  GGAACGACAGTGATCG          0.250     0.000000   \n",
       "959996  avg1muts  GGAACGACAGTGATCG          0.500     0.000000   \n",
       "959997  avg1muts  GGAACGACAGTGATCG          1.000     0.000000   \n",
       "959998  avg1muts  GGAACGACAGTGATCG          2.000     0.021968   \n",
       "959999  avg1muts  GGAACGACAGTGATCG          4.000     0.000000   \n",
       "\n",
       "       aa_substitutions      IC90  \n",
       "0                        0.082119  \n",
       "1                        0.082119  \n",
       "2                        0.082119  \n",
       "3                        0.082119  \n",
       "4                        0.082119  \n",
       "...                 ...       ...  \n",
       "959995      Y508W T531H  0.082119  \n",
       "959996      Y508W T531H  0.082119  \n",
       "959997      Y508W T531H  0.082119  \n",
       "959998      Y508W T531H  0.082119  \n",
       "959999      Y508W T531H  0.082119  \n",
       "\n",
       "[960000 rows x 6 columns]"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "variants_escape = model_3ep.icXX(variants_escape, x=0.9, col=\"IC90\")\n",
    "\n",
    "noisy_variants_escape = model_3ep.icXX(noisy_variants_escape, x=0.9, col=\"IC90\")\n",
    "\n",
    "noisy_variants_escape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will write these data frames giving the variants $v$ and their probabilities of escape $p_v\\left(c\\right)$ at each serum concentration $c$ to CSV files:\n",
    "\n",
    " - [RBD_variants_escape_exact.csv](RBD_variants_escape_exact.csv) for the measurements without noise\n",
    " - [RBD_variants_escape_noisy.csv](RBD_variants_escape_noisy.csv) for the measurements with realistic experimental noise\n",
    "\n",
    "The goal is to infer the properties of the escape mutations and the serum, $a_{\\rm{wt},e}$ and $\\beta_{m,e}$, from these data, which are what would be measured in a deep mutational scanning experiment."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-02-20T14:52:22.415643Z",
     "iopub.status.busy": "2023-02-20T14:52:22.415330Z",
     "iopub.status.idle": "2023-02-20T14:52:30.872563Z",
     "shell.execute_reply": "2023-02-20T14:52:30.871615Z",
     "shell.execute_reply.started": "2023-02-20T14:52:22.415623Z"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "variants_escape.to_csv(\n",
    "    \"RBD_variants_escape_exact.csv\", index=False, float_format=\"%.4g\"\n",
    ")\n",
    "\n",
    "noisy_variants_escape.to_csv(\n",
    "    \"RBD_variants_escape_noisy.csv\", index=False, float_format=\"%.4g\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Simulate library with 2 epitopes and non-one Hill coefficient and non-zero non-neutralizable fraction\n",
    "Create a two-epitope model with the Hill coefficent not equal to one, and the non-neutralizable fraction greater than zero.\n",
    "To do this, we just keep two epitopes from the above model and rename them:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-02-20T14:52:30.874086Z",
     "iopub.status.busy": "2023-02-20T14:52:30.873709Z",
     "iopub.status.idle": "2023-02-20T14:52:30.925043Z",
     "shell.execute_reply": "2023-02-20T14:52:30.924382Z",
     "shell.execute_reply.started": "2023-02-20T14:52:30.874059Z"
    },
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>epitope</th>\n",
       "      <th>activity</th>\n",
       "      <th>hill_coefficient</th>\n",
       "      <th>non_neutralized_frac</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>class 2</td>\n",
       "      <td>3.2</td>\n",
       "      <td>1.2</td>\n",
       "      <td>0.01</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>class 3</td>\n",
       "      <td>2.4</td>\n",
       "      <td>1.6</td>\n",
       "      <td>0.05</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   epitope  activity  hill_coefficient  non_neutralized_frac\n",
       "0  class 2       3.2               1.2                  0.01\n",
       "1  class 3       2.4               1.6                  0.05"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model_2ep = polyclonal.Polyclonal(\n",
    "    activity_wt_df=activity_wt_df.query(\"epitope != 'class 1'\"),\n",
    "    mut_escape_df=mut_escape_df.query(\"epitope != 'class 1'\"),\n",
    "    hill_coefficient_df=pd.DataFrame(\n",
    "        {\"epitope\": [\"class 2\", \"class 3\"], \"hill_coefficient\": [1.2, 1.6]}\n",
    "    ),\n",
    "    non_neutralized_frac_df=pd.DataFrame(\n",
    "        {\"epitope\": [\"class 2\", \"class 3\"], \"non_neutralized_frac\": [0.01, 0.05]},\n",
    "    ),\n",
    ")\n",
    "\n",
    "model_2ep.curve_specs_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now compute probability of escape for 2-mutation variants, adding some noise.\n",
    "Also compute the true IC90 from the model.\n",
    "This is the simulated library:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-02-20T14:52:30.926062Z",
     "iopub.status.busy": "2023-02-20T14:52:30.925802Z",
     "iopub.status.idle": "2023-02-20T14:52:38.948339Z",
     "shell.execute_reply": "2023-02-20T14:52:38.947660Z",
     "shell.execute_reply.started": "2023-02-20T14:52:30.926046Z"
    },
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Average escape at each concentration:\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>prob_escape</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>concentration</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0.25</th>\n",
       "      <td>0.070979</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1.00</th>\n",
       "      <td>0.033363</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4.00</th>\n",
       "      <td>0.026064</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "               prob_escape\n",
       "concentration             \n",
       "0.25              0.070979\n",
       "1.00              0.033363\n",
       "4.00              0.026064"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "variants_2ep = (\n",
    "    model_2ep.prob_escape(\n",
    "        variants_df=variants_df.query(\"library == 'avg2muts'\"),\n",
    "        concentrations=[0.25, 1, 4],\n",
    "    )\n",
    "    .rename(columns={\"predicted_prob_escape\": \"prob_escape\"})\n",
    "    .drop(columns=[\"n_aa_substitutions\", \"library\"])\n",
    "    .assign(\n",
    "        # add Gaussian noise with standard deviation of 0.03\n",
    "        prob_escape=lambda x: (\n",
    "            x[\"prob_escape\"] + numpy.random.normal(scale=0.03, size=len(x))\n",
    "        ).clip(lower=0, upper=1),\n",
    "        # add rare sequencing errors\n",
    "        aa_substitutions=lambda x: x[\"aa_substitutions\"].map(add_subtract_mutation),\n",
    "    )\n",
    ")\n",
    "\n",
    "variants_2ep = model_2ep.icXX(variants_2ep, x=0.9, col=\"true IC90\")\n",
    "\n",
    "print(f\"Average escape at each concentration:\")\n",
    "display(variants_2ep.groupby(\"concentration\").aggregate({\"prob_escape\": \"mean\"}))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We write details on the 2-epitope model to files:\n",
    "\n",
    " - [2epitope_model_curve_specs.csv](2epitope_model_curve_specs.csv) the neutralization curve parameters\n",
    " - [2epitope_escape.csv](2epitope_escape.csv) the variants and their measured probability of escape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-02-20T14:52:38.949410Z",
     "iopub.status.busy": "2023-02-20T14:52:38.949149Z",
     "iopub.status.idle": "2023-02-20T14:52:39.492571Z",
     "shell.execute_reply": "2023-02-20T14:52:39.491984Z",
     "shell.execute_reply.started": "2023-02-20T14:52:38.949394Z"
    }
   },
   "outputs": [],
   "source": [
    "model_2ep.curve_specs_df.to_csv(\"2epitope_model_curve_specs.csv\", index=False)\n",
    "\n",
    "variants_2ep.to_csv(\"2epitope_escape.csv\", index=False, float_format=\"%.4g\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}