{ "cells": [ { "cell_type": "markdown", "id": "b9e3d531", "metadata": {}, "source": [ "# Averaging models\n", "Probably the best way to ensure robust inferences and estimate errors is to have multiple experimental replicates, ideally on different libraries.\n", "\n", "Here we describe how to average model fits across libraries and/or replicates." ] }, { "cell_type": "markdown", "id": "6e2135ec", "metadata": {}, "source": [ "## Split data into replicates\n", "We will use our data for the RBD as an earlier examples, but split it into several libraries / replicates.\n", "\n", "Specifically, we will fit two different libraries: `avg2muts` and `avg3muts`, which have different barcodes and also different mutation rates (although of course in real life you might sometimes want to average results from different libraries with the same mutation rates).\n", "We will also simulate having two replicates for each library just by sampling each library.\n", "To make this example faster, we'll just use one concentration:" ] }, { "cell_type": "code", "execution_count": 1, "id": "b242df9c", "metadata": { "execution": { "iopub.execute_input": "2023-04-11T20:19:12.879382Z", "iopub.status.busy": "2023-04-11T20:19:12.879157Z", "iopub.status.idle": "2023-04-11T20:19:15.737800Z", "shell.execute_reply": "2023-04-11T20:19:15.737059Z", "shell.execute_reply.started": "2023-04-11T20:19:12.879360Z" }, "tags": [] }, "outputs": [], "source": [ "import pandas as pd\n", "\n", "import polyclonal.polyclonal\n", "import polyclonal.polyclonal_collection\n", "\n", "\n", "# read data\n", "all_data = pd.read_csv(\"RBD_variants_escape_noisy.csv\", na_filter=None)\n", "\n", "# split by library and replicates\n", "libraries = [\"avg2muts\", \"avg3muts\"] # the two libraries to use\n", "concentrations = [1] # use just use this concentration\n", "n_replicates = 2 # number of replicates per library\n", "\n", "data_by_replicate = {\n", " (library, replicate + 1): (\n", " all_data.query(\"library == @library\")\n", " .query(\"concentration in @concentrations\")\n", " .sample(frac=0.3, random_state=replicate)\n", " )\n", " for library in libraries\n", " for replicate in range(n_replicates)\n", "}" ] }, { "cell_type": "markdown", "id": "bb47c6ce", "metadata": {}, "source": [ "## Fit models to each replicate\n", "We now fit a `Polyclonal` model to each replicate using just 2 epitopes, as the data don't seem sufficient to accurately fit all three epitopes.\n", "Then we arrange the models in a data frame:" ] }, { "cell_type": "code", "execution_count": 2, "id": "1f86179e", "metadata": { "execution": { "iopub.execute_input": "2023-04-11T20:19:15.741207Z", "iopub.status.busy": "2023-04-11T20:19:15.740852Z", "iopub.status.idle": "2023-04-11T20:20:52.151015Z", "shell.execute_reply": "2023-04-11T20:20:52.149442Z", "shell.execute_reply.started": "2023-04-11T20:19:15.741183Z" }, "tags": [] }, "outputs": [], "source": [ "# first create a data frame with all the models\n", "models_by_replicate = {}\n", "for (library, replicate), data in data_by_replicate.items():\n", " model = polyclonal.Polyclonal(data_to_fit=data, n_epitopes=2)\n", " models_by_replicate[(library, replicate)] = model\n", "models_df = (\n", " pd.Series(models_by_replicate, name=\"model\")\n", " .rename_axis([\"library\", \"replicate\"])\n", " .reset_index()\n", ")\n", "\n", "# now fit the models\n", "n_fit, n_failed, models_df[\"model\"] = polyclonal.polyclonal_collection.fit_models(\n", " models_df[\"model\"],\n", " n_threads=2,\n", " reg_escape_weight=0.01,\n", " reg_uniqueness2_weight=0,\n", ")" ] }, { "cell_type": "markdown", "id": "9b91ea0e", "metadata": {}, "source": [ "Note how the models are arranged in a data frame:" ] }, { "cell_type": "code", "execution_count": 3, "id": "d6e2060b", "metadata": { "execution": { "iopub.execute_input": "2023-04-11T20:20:52.158647Z", "iopub.status.busy": "2023-04-11T20:20:52.158108Z", "iopub.status.idle": "2023-04-11T20:20:52.176663Z", "shell.execute_reply": "2023-04-11T20:20:52.176005Z", "shell.execute_reply.started": "2023-04-11T20:20:52.158597Z" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
libraryreplicatemodel
0avg2muts1<polyclonal.polyclonal.Polyclonal object at 0x...
1avg2muts2<polyclonal.polyclonal.Polyclonal object at 0x...
2avg3muts1<polyclonal.polyclonal.Polyclonal object at 0x...
3avg3muts2<polyclonal.polyclonal.Polyclonal object at 0x...
\n", "
" ], "text/plain": [ " library replicate model\n", "0 avg2muts 1 \n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# NBVAL_IGNORE_OUTPUT\n", "\n", "model_avg.mut_escape_corr_heatmap()" ] }, { "cell_type": "markdown", "id": "0cc96ce9", "metadata": {}, "source": [ "Look at the activities of the epitopes and the rest of the curves.\n", "Note how a dark line is shown for the average, and thin lines for individual replicates.\n", "It should generally be the case that the epitope with greater activity (more left shifted in plot below) should also be better correlated among replicates (heatmap above) as it can be inferred more reliably:" ] }, { "cell_type": "code", "execution_count": 6, "id": "3f07294f", "metadata": { "execution": { "iopub.execute_input": "2023-04-11T20:20:53.870946Z", "iopub.status.busy": "2023-04-11T20:20:53.870733Z", "iopub.status.idle": "2023-04-11T20:20:54.034500Z", "shell.execute_reply": "2023-04-11T20:20:54.033824Z", "shell.execute_reply.started": "2023-04-11T20:20:53.870931Z" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# NBVAL_IGNORE_OUTPUT\n", "\n", "model_avg.curves_plot()" ] }, { "cell_type": "markdown", "id": "def38e24", "metadata": {}, "source": [ "We can access the average escape values.\n", "Note that there is also a column *escape_min_magnitude* that gives the lowest magnitude value:" ] }, { "cell_type": "code", "execution_count": 7, "id": "acf519c5", "metadata": { "execution": { "iopub.execute_input": "2023-04-11T20:20:54.035543Z", "iopub.status.busy": "2023-04-11T20:20:54.035311Z", "iopub.status.idle": "2023-04-11T20:20:54.510553Z", "shell.execute_reply": "2023-04-11T20:20:54.509969Z", "shell.execute_reply.started": "2023-04-11T20:20:54.035523Z" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
epitopesitewildtypemutantmutationescape_meanescape_medianescape_min_magnitudeescape_stdn_modelstimes_seenfrac_models
01331NAN331A-0.066612-0.0317180.0266790.324873417.751.0
11331NDN331D-0.098052-0.100283-0.0711110.022076411.251.0
21331NEN331E-0.061359-0.008731-0.0016550.112576410.251.0
31331NFN331F0.2702430.1708690.0145520.330315410.001.0
41331NGN331G0.1862710.1649410.0313690.232272425.001.0
.......................................
38592531TRT531R-0.176471-0.112467-0.0338990.187640427.001.0
38602531TST531S-0.075136-0.0117410.0004050.155037431.751.0
38612531TVT531V-0.041719-0.0055890.0123830.094055419.501.0
38622531TWT531W0.1697180.100366-0.0041330.26661745.251.0
38632531TYT531Y-0.051762-0.050223-0.0324610.020582411.751.0
\n", "

3864 rows × 12 columns

\n", "
" ], "text/plain": [ " epitope site wildtype mutant mutation escape_mean escape_median \n", "0 1 331 N A N331A -0.066612 -0.031718 \\\n", "1 1 331 N D N331D -0.098052 -0.100283 \n", "2 1 331 N E N331E -0.061359 -0.008731 \n", "3 1 331 N F N331F 0.270243 0.170869 \n", "4 1 331 N G N331G 0.186271 0.164941 \n", "... ... ... ... ... ... ... ... \n", "3859 2 531 T R T531R -0.176471 -0.112467 \n", "3860 2 531 T S T531S -0.075136 -0.011741 \n", "3861 2 531 T V T531V -0.041719 -0.005589 \n", "3862 2 531 T W T531W 0.169718 0.100366 \n", "3863 2 531 T Y T531Y -0.051762 -0.050223 \n", "\n", " escape_min_magnitude escape_std n_models times_seen frac_models \n", "0 0.026679 0.324873 4 17.75 1.0 \n", "1 -0.071111 0.022076 4 11.25 1.0 \n", "2 -0.001655 0.112576 4 10.25 1.0 \n", "3 0.014552 0.330315 4 10.00 1.0 \n", "4 0.031369 0.232272 4 25.00 1.0 \n", "... ... ... ... ... ... \n", "3859 -0.033899 0.187640 4 27.00 1.0 \n", "3860 0.000405 0.155037 4 31.75 1.0 \n", "3861 0.012383 0.094055 4 19.50 1.0 \n", "3862 -0.004133 0.266617 4 5.25 1.0 \n", "3863 -0.032461 0.020582 4 11.75 1.0 \n", "\n", "[3864 rows x 12 columns]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# NBVAL_IGNORE_OUTPUT\n", "\n", "model_avg.mut_escape_df" ] }, { "cell_type": "code", "execution_count": null, "id": "8060db93-7f29-40eb-b441-de2d32985bc6", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "faf3de9c", "metadata": {}, "source": [ "Or the per-replicate escape values:" ] }, { "cell_type": "code", "execution_count": 8, "id": "b882943f", "metadata": { "execution": { "iopub.execute_input": "2023-04-11T20:20:54.511514Z", "iopub.status.busy": "2023-04-11T20:20:54.511254Z", "iopub.status.idle": "2023-04-11T20:20:54.592295Z", "shell.execute_reply": "2023-04-11T20:20:54.591799Z", "shell.execute_reply.started": "2023-04-11T20:20:54.511498Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
epitopesitewildtypemutantmutationescapetimes_seenlibraryreplicate
01331NAN331A-0.09011519avg2muts1
11331NDN331D-0.11079510avg2muts1
21331NEN331E-0.00165511avg2muts1
31331NFN331F0.03730510avg2muts1
41331NGN331G0.03136918avg2muts1
..............................
154292531TRT531R-0.03389924avg3muts2
154302531TST531S0.00040542avg3muts2
154312531TVT531V-0.02356224avg3muts2
154322531TWT531W0.5324766avg3muts2
154332531TYT531Y-0.0741429avg3muts2
\n", "

15434 rows × 9 columns

\n", "
" ], "text/plain": [ " epitope site wildtype mutant mutation escape times_seen library \n", "0 1 331 N A N331A -0.090115 19 avg2muts \\\n", "1 1 331 N D N331D -0.110795 10 avg2muts \n", "2 1 331 N E N331E -0.001655 11 avg2muts \n", "3 1 331 N F N331F 0.037305 10 avg2muts \n", "4 1 331 N G N331G 0.031369 18 avg2muts \n", "... ... ... ... ... ... ... ... ... \n", "15429 2 531 T R T531R -0.033899 24 avg3muts \n", "15430 2 531 T S T531S 0.000405 42 avg3muts \n", "15431 2 531 T V T531V -0.023562 24 avg3muts \n", "15432 2 531 T W T531W 0.532476 6 avg3muts \n", "15433 2 531 T Y T531Y -0.074142 9 avg3muts \n", "\n", " replicate \n", "0 1 \n", "1 1 \n", "2 1 \n", "3 1 \n", "4 1 \n", "... ... \n", "15429 2 \n", "15430 2 \n", "15431 2 \n", "15432 2 \n", "15433 2 \n", "\n", "[15434 rows x 9 columns]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# NBVAL_IGNORE_OUTPUT\n", "\n", "model_avg.mut_escape_df_replicates" ] }, { "cell_type": "markdown", "id": "b8912133", "metadata": {}, "source": [ "Now let's plot the escape.\n", "See how you can select mutations based not only on how many times they are seen (averaged over all models in average), but also the number of models in which they are seen." ] }, { "cell_type": "code", "execution_count": 9, "id": "b085a23f", "metadata": { "execution": { "iopub.execute_input": "2023-04-11T20:20:54.593484Z", "iopub.status.busy": "2023-04-11T20:20:54.593212Z", "iopub.status.idle": "2023-04-11T20:20:57.144695Z", "shell.execute_reply": "2023-04-11T20:20:57.143640Z", "shell.execute_reply.started": "2023-04-11T20:20:54.593466Z" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.VConcatChart(...)" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# NBVAL_IGNORE_OUTPUT\n", "\n", "model_avg.mut_escape_plot(addtl_slider_stats={\"times_seen\": 2})" ] }, { "cell_type": "markdown", "id": "028e40f9-8d7f-43ca-aba1-50a80f60332f", "metadata": {}, "source": [ "Here is the same plot plotting the loweset magnitude escape value across libraries for each mutation:" ] }, { "cell_type": "code", "execution_count": 10, "id": "6afa045f-4a62-4b4d-8285-4b550823a022", "metadata": { "execution": { "iopub.execute_input": "2023-04-11T20:23:44.191639Z", "iopub.status.busy": "2023-04-11T20:23:44.191154Z", "iopub.status.idle": "2023-04-11T20:23:46.768203Z", "shell.execute_reply": "2023-04-11T20:23:46.767254Z", "shell.execute_reply.started": "2023-04-11T20:23:44.191602Z" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.VConcatChart(...)" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# NBVAL_IGNORE_OUTPUT\n", "\n", "model_avg.mut_escape_plot(\n", " addtl_slider_stats={\"times_seen\": 2}, avg_type=\"min_magnitude\"\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "b2ba1eb9-89f4-4c87-b17f-049f5bf05dba", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.0" } }, "nbformat": 4, "nbformat_minor": 5 }