{ "cells": [ { "cell_type": "markdown", "id": "58388f02", "metadata": {}, "source": [ "# Reference site numbering\n", "So far all the preceding examples have involved analysis of proteins with sequential integer site numbers (e.g., 1, 2, 3, ...).\n", "This sequential integer numbering can also start at sites other than one (e.g., 331, 332, 333, ...).\n", "\n", "However, often proteins are numbered in relation to alignment to sequential integer numbering of some reference homolog.\n", "For instance, SARS-CoV-2 spike sequences are usually numbered in relation to the Wuhan-Hu-1 reference.\n", "But if the protein has accumulated insertions or deletions relative to the reference, there may be missing sites (gaps) or insertions (typically numbered as `214`, `214a`, `214b`, etc).\n", "\n", "This notebook shows how to perform an analysis and analyze the results with non-sequential reference site numbering.\n", "The advantage of doing this is that the results can directly be visualized/analyzed using the conventional site numbering scheme.\n", "\n", "It does this by analyzing deep mutational scanning of the SARS-CoV-2 Omicron BA.1 spike, which has several indels relative to the Wuhan-Hu-1 numbering reference." ] }, { "cell_type": "markdown", "id": "875fb573", "metadata": {}, "source": [ "## Fit and visualize a model with non-integer site numbering\n", "\n", "First, import Python modules:" ] }, { "cell_type": "code", "execution_count": 1, "id": "145073d3", "metadata": { "execution": { "iopub.execute_input": "2023-10-17T03:36:24.809138Z", "iopub.status.busy": "2023-10-17T03:36:24.808603Z", "iopub.status.idle": "2023-10-17T03:36:29.786185Z", "shell.execute_reply": "2023-10-17T03:36:29.784962Z", "shell.execute_reply.started": "2023-10-17T03:36:24.809085Z" }, "tags": [] }, "outputs": [], "source": [ "import polyclonal\n", "\n", "import pandas as pd" ] }, { "cell_type": "markdown", "id": "a3dd922a", "metadata": {}, "source": [ "Now read the data to fit:" ] }, { "cell_type": "code", "execution_count": 2, "id": "47a61b43", "metadata": { "execution": { "iopub.execute_input": "2023-10-17T03:36:29.790275Z", "iopub.status.busy": "2023-10-17T03:36:29.790036Z", "iopub.status.idle": "2023-10-17T03:36:30.677568Z", "shell.execute_reply": "2023-10-17T03:36:30.676654Z", "shell.execute_reply.started": "2023-10-17T03:36:29.790252Z" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
antibody_concentrationbarcodeaa_substitutions_sequentialaa_substitutions_referencen_aa_substitutionsprob_escapeprob_escape_uncensoredno-antibody_count
00.654AAGAGCTTACTCTCGAS443P S1167PS446P S1170P20.90040.90042890
10.654ATAAGATAGATTTAGGK441T A519SK444T A522S20.80060.80062217
20.654GATCGAGTGTGTAGCAF152T K441EF157T K444E20.57970.57972966
30.654CCAAACGGTATGATGAQ14H V67P D212H N445T I1224MQ14H V67P D215H N448T I1227M50.41250.41254131
40.654TTACTGTGCAACCCAAF163L N445SF168L N448S20.36000.36004275
\n", "
" ], "text/plain": [ " antibody_concentration barcode aa_substitutions_sequential \\\n", "0 0.654 AAGAGCTTACTCTCGA S443P S1167P \n", "1 0.654 ATAAGATAGATTTAGG K441T A519S \n", "2 0.654 GATCGAGTGTGTAGCA F152T K441E \n", "3 0.654 CCAAACGGTATGATGA Q14H V67P D212H N445T I1224M \n", "4 0.654 TTACTGTGCAACCCAA F163L N445S \n", "\n", " aa_substitutions_reference n_aa_substitutions prob_escape \\\n", "0 S446P S1170P 2 0.9004 \n", "1 K444T A522S 2 0.8006 \n", "2 F157T K444E 2 0.5797 \n", "3 Q14H V67P D215H N448T I1227M 5 0.4125 \n", "4 F168L N448S 2 0.3600 \n", "\n", " prob_escape_uncensored no-antibody_count \n", "0 0.9004 2890 \n", "1 0.8006 2217 \n", "2 0.5797 2966 \n", "3 0.4125 4131 \n", "4 0.3600 4275 " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# read data w `na_filter=None` so empty aa_substitutions read as such rather than NA\n", "data_to_fit = pd.read_csv(\n", " \"Lib-2_2022-06-22_thaw-1_LyCoV-1404_1_prob_escape.csv\",\n", " na_filter=None,\n", ")\n", "\n", "data_to_fit.head()" ] }, { "cell_type": "markdown", "id": "136e4e11", "metadata": {}, "source": [ "Notice how these data have sites numbered in two different schemes:\n", "\n", " 1. `aa_substitutions_sequential`: sequential integer numbering (1, 2, 3, ...) of the protein used in the experiment. You could just analyze the mutations this way, but then the results will not be in standard reference numbering scheme.\n", " \n", " 2. `aa_substitutions_reference`: the referencey based site numbering, which skips sites with indels and has some non-numeric sites where there are insertions (eg, `214a`), as for example in the variant shown below:" ] }, { "cell_type": "code", "execution_count": 3, "id": "7286520e", "metadata": { "execution": { "iopub.execute_input": "2023-10-17T03:36:30.681350Z", "iopub.status.busy": "2023-10-17T03:36:30.681152Z", "iopub.status.idle": "2023-10-17T03:36:30.797991Z", "shell.execute_reply": "2023-10-17T03:36:30.797181Z", "shell.execute_reply.started": "2023-10-17T03:36:30.681330Z" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
antibody_concentrationbarcodeaa_substitutions_sequentialaa_substitutions_referencen_aa_substitutionsprob_escapeprob_escape_uncensoredno-antibody_count
320.654CTAGATAAATCCCTGCE209A K441NE214aA K444N20.39650.39652214
\n", "
" ], "text/plain": [ " antibody_concentration barcode aa_substitutions_sequential \\\n", "32 0.654 CTAGATAAATCCCTGC E209A K441N \n", "\n", " aa_substitutions_reference n_aa_substitutions prob_escape \\\n", "32 E214aA K444N 2 0.3965 \n", "\n", " prob_escape_uncensored no-antibody_count \n", "32 0.3965 2214 " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_to_fit.query(\"aa_substitutions_reference.str.contains('a')\").head(n=1)" ] }, { "cell_type": "markdown", "id": "6fcd4cc2", "metadata": {}, "source": [ "Here we will use the reference based amino-acid substitutions, so create a column with the name used by `Polyclonal` (the \"aa_substitutions\" column) that uses this numbering scheme:" ] }, { "cell_type": "code", "execution_count": 4, "id": "8904ba0f", "metadata": { "execution": { "iopub.execute_input": "2023-10-17T03:36:30.803486Z", "iopub.status.busy": "2023-10-17T03:36:30.803214Z", "iopub.status.idle": "2023-10-17T03:36:30.810609Z", "shell.execute_reply": "2023-10-17T03:36:30.809836Z", "shell.execute_reply.started": "2023-10-17T03:36:30.803465Z" }, "tags": [] }, "outputs": [], "source": [ "data_to_fit[\"aa_substitutions\"] = data_to_fit[\"aa_substitutions_reference\"]" ] }, { "cell_type": "markdown", "id": "d0c01ccb", "metadata": {}, "source": [ "Importantly, in order to use reference based numbering that is not sequential integer, you also have to provide a list of the sites in order.\n", "The reason is that otherwise it's not possible for `Polyclonal` to figure out for instance if sites are just missing from the data are are actually deletions.\n", "\n", "Here we read a data frame that maps sequential to reference site numbering, and then use that to extract the list of reference sites.\n", "Note this plot also contains the protein regions, which we will use later:" ] }, { "cell_type": "code", "execution_count": 5, "id": "32a5e4ac", "metadata": { "execution": { "iopub.execute_input": "2023-10-17T03:36:30.813724Z", "iopub.status.busy": "2023-10-17T03:36:30.813477Z", "iopub.status.idle": "2023-10-17T03:36:30.866805Z", "shell.execute_reply": "2023-10-17T03:36:30.866030Z", "shell.execute_reply.started": "2023-10-17T03:36:30.813703Z" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sequential_sitereference_siteregion
00-1other
111other
222other
333other
444other
\n", "
" ], "text/plain": [ " sequential_site reference_site region\n", "0 0 -1 other\n", "1 1 1 other\n", "2 2 2 other\n", "3 3 3 other\n", "4 4 4 other" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Note how some reference sites differ from sequential ones due to indels:\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sequential_sitereference_siteregion
00-1other
696971NTD
707072NTD
717173NTD
727274NTD
\n", "
" ], "text/plain": [ " sequential_site reference_site region\n", "0 0 -1 other\n", "69 69 71 NTD\n", "70 70 72 NTD\n", "71 71 73 NTD\n", "72 72 74 NTD" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "site_numbering_map = pd.read_csv(\"BA.1_site_numbering_map.csv\")\n", "display(site_numbering_map.head())\n", "\n", "print(\"Note how some reference sites differ from sequential ones due to indels:\")\n", "display(\n", " site_numbering_map[\n", " site_numbering_map[\"sequential_site\"].astype(str)\n", " != site_numbering_map[\"reference_site\"]\n", " ].head()\n", ")\n", "\n", "sites = site_numbering_map[\"reference_site\"].tolist()" ] }, { "cell_type": "markdown", "id": "5d86c293", "metadata": {}, "source": [ "Now initialize and fit the `Polyclonal` model, but pass the `sites` argument so we can use these non-sequential-integer reference sites.\n", "(If you don't pass the `sites` argument, sites are assumed to be sequential integer):" ] }, { "cell_type": "code", "execution_count": 6, "id": "b97248e4", "metadata": { "execution": { "iopub.execute_input": "2023-10-17T03:36:30.869878Z", "iopub.status.busy": "2023-10-17T03:36:30.869709Z", "iopub.status.idle": "2023-10-17T03:36:34.113712Z", "shell.execute_reply": "2023-10-17T03:36:34.113120Z", "shell.execute_reply.started": "2023-10-17T03:36:30.869860Z" }, "tags": [] }, "outputs": [], "source": [ "model = polyclonal.Polyclonal(\n", " # `polyclonal` expects the concentration column to be named \"concentration\"\n", " data_to_fit=data_to_fit.rename(columns={\"antibody_concentration\": \"concentration\"}),\n", " n_epitopes=1,\n", " alphabet=polyclonal.AAS_WITHSTOP_WITHGAP,\n", " sites=sites,\n", ")" ] }, { "cell_type": "markdown", "id": "a91998d4", "metadata": {}, "source": [ "Note how the model has its `sequential_integer_sites` attribute set to `False`:" ] }, { "cell_type": "code", "execution_count": 7, "id": "c23db382", "metadata": { "execution": { "iopub.execute_input": "2023-10-17T03:36:34.116117Z", "iopub.status.busy": "2023-10-17T03:36:34.115984Z", "iopub.status.idle": "2023-10-17T03:36:34.118891Z", "shell.execute_reply": "2023-10-17T03:36:34.118472Z", "shell.execute_reply.started": "2023-10-17T03:36:34.116103Z" }, "tags": [] }, "outputs": [], "source": [ "assert set(model.sites) == set(sites)\n", "\n", "assert model.sequential_integer_sites is False" ] }, { "cell_type": "markdown", "id": "c4cbbc49", "metadata": {}, "source": [ "Now fit the model:" ] }, { "cell_type": "code", "execution_count": 8, "id": "66d7cadc", "metadata": { "execution": { "iopub.execute_input": "2023-10-17T03:36:34.121300Z", "iopub.status.busy": "2023-10-17T03:36:34.121167Z", "iopub.status.idle": "2023-10-17T03:38:47.827067Z", "shell.execute_reply": "2023-10-17T03:38:47.825151Z", "shell.execute_reply.started": "2023-10-17T03:36:34.121287Z" }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "#\n", "# Fitting site-level model.\n", "# Starting optimization of 1248 parameters at Mon Oct 16 20:36:36 2023.\n", " step time_sec loss fit_loss reg_escape reg_spread reg_spatial reg_uniqueness reg_uniqueness2 reg_activity reg_hill_coefficient reg_non_neutralized_frac\n", " 0 0.075336 37318 37314 0 0 0 0 0 3.7281 0 0\n", " 177 16.341 4871.2 4822.7 39.285 0 0 0 0 9.2641 0 0\n", "# Successfully finished at Mon Oct 16 20:36:53 2023.\n", "#\n", "# Fitting model.\n", "# Starting optimization of 8450 parameters at Mon Oct 16 20:36:53 2023.\n", " step time_sec loss fit_loss reg_escape reg_spread reg_spatial reg_uniqueness reg_uniqueness2 reg_activity reg_hill_coefficient reg_non_neutralized_frac\n", " 0 3.5293 7305.4 7031.5 264.69 4.1727e-31 0 0 0 9.2641 0 0\n", " 200 106.55 6812.9 6701.1 89.496 11.721 0 0 0 10.565 0 0\n", " 257 114.37 6812.4 6700.3 89.672 11.869 0 0 0 10.564 0 0\n", "# Successfully finished at Mon Oct 16 20:38:47 2023.\n" ] } ], "source": [ "# NBVAL_IGNORE_OUTPUT\n", "\n", "_ = model.fit(\n", " logfreq=200,\n", " reg_escape_weight=0.1,\n", " fix_hill_coefficient=True,\n", " fix_non_neutralized_frac=True,\n", ")" ] }, { "cell_type": "markdown", "id": "58655e40", "metadata": {}, "source": [ "Look at output.\n", "First we look at the `mut_escape_df`.\n", "The site entries are str:" ] }, { "cell_type": "code", "execution_count": 9, "id": "fb2b6a7e", "metadata": { "execution": { "iopub.execute_input": "2023-10-17T03:38:47.833912Z", "iopub.status.busy": "2023-10-17T03:38:47.833541Z", "iopub.status.idle": "2023-10-17T03:38:48.014946Z", "shell.execute_reply": "2023-10-17T03:38:48.014257Z", "shell.execute_reply.started": "2023-10-17T03:38:47.833876Z" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
epitopesitewildtypemutantmutationescapetimes_seen
011MIM1I-0.001
111MTM1T-0.012
212FIF2I0.002
312FLF2L0.0114
412FSF2S-0.0114
\n", "
" ], "text/plain": [ " epitope site wildtype mutant mutation escape times_seen\n", "0 1 1 M I M1I -0.00 1\n", "1 1 1 M T M1T -0.01 2\n", "2 1 2 F I F2I 0.00 2\n", "3 1 2 F L F2L 0.01 14\n", "4 1 2 F S F2S -0.01 14" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# NBVAL_IGNORE_OUTPUT\n", "\n", "assert all(model.mut_escape_df[\"site\"].astype(str) == model.mut_escape_df[\"site\"])\n", "assert set(model.mut_escape_df[\"site\"]).issubset(sites)\n", "\n", "model.mut_escape_df.head().round(2)" ] }, { "cell_type": "markdown", "id": "098e2311", "metadata": {}, "source": [ "The same is true for `mut_escape_site_summary_df`:" ] }, { "cell_type": "code", "execution_count": 10, "id": "e66e8ec3", "metadata": { "execution": { "iopub.execute_input": "2023-10-17T03:38:48.017766Z", "iopub.status.busy": "2023-10-17T03:38:48.017589Z", "iopub.status.idle": "2023-10-17T03:38:48.250680Z", "shell.execute_reply": "2023-10-17T03:38:48.249962Z", "shell.execute_reply.started": "2023-10-17T03:38:48.017749Z" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
epitopesitewildtypemeantotal positivemaxmintotal negativen mutations
011M-0.000.00-0.00-0.01-0.012
112F0.010.030.02-0.01-0.014
213V0.060.650.59-0.17-0.218
314F0.221.141.07-0.02-0.045
415L0.002.161.09-1.02-2.1220
\n", "
" ], "text/plain": [ " epitope site wildtype mean total positive max min total negative \\\n", "0 1 1 M -0.00 0.00 -0.00 -0.01 -0.01 \n", "1 1 2 F 0.01 0.03 0.02 -0.01 -0.01 \n", "2 1 3 V 0.06 0.65 0.59 -0.17 -0.21 \n", "3 1 4 F 0.22 1.14 1.07 -0.02 -0.04 \n", "4 1 5 L 0.00 2.16 1.09 -1.02 -2.12 \n", "\n", " n mutations \n", "0 2 \n", "1 4 \n", "2 8 \n", "3 5 \n", "4 20 " ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# NBVAL_IGNORE_OUTPUT\n", "\n", "assert all(\n", " model.mut_escape_site_summary_df()[\"site\"].astype(str)\n", " == model.mut_escape_site_summary_df()[\"site\"]\n", ")\n", "assert set(model.mut_escape_site_summary_df()[\"site\"]).issubset(sites)\n", "\n", "model.mut_escape_site_summary_df().head().round(2)" ] }, { "cell_type": "markdown", "id": "9a1c5ecb", "metadata": {}, "source": [ "Now plot the escape values.\n", "Note how we merge in the data frame with sequential site numbers and protein regions and use it to color the zoom bar and add sequential sites to the lineplot and heatmap tooltips:" ] }, { "cell_type": "code", "execution_count": 11, "id": "48d49cfb", "metadata": { "execution": { "iopub.execute_input": "2023-10-17T03:38:48.254208Z", "iopub.status.busy": "2023-10-17T03:38:48.253925Z", "iopub.status.idle": "2023-10-17T03:38:50.154569Z", "shell.execute_reply": "2023-10-17T03:38:50.153535Z", "shell.execute_reply.started": "2023-10-17T03:38:48.254191Z" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "
\n", "" ], "text/plain": [ "alt.VConcatChart(...)" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# NBVAL_IGNORE_OUTPUT\n", "\n", "model.mut_escape_plot(\n", " df_to_merge=site_numbering_map.rename(columns={\"reference_site\": \"site\"}),\n", " addtl_tooltip_stats=[\"sequential_site\"],\n", " site_zoom_bar_color_col=\"region\",\n", " addtl_slider_stats={\"times_seen\": 2},\n", ")" ] }, { "cell_type": "markdown", "id": "7baf05dc", "metadata": {}, "source": [ "## Confirm same results for reference and sequential integer number\n", "To demonstrate how the results are the same regardless of which numbering scheme is used for the fitting, we also fit a model with the sequentially numbered variants.\n", "This section of the notebook can almost be considered a test rather than an example." ] }, { "cell_type": "code", "execution_count": 12, "id": "d27d81af", "metadata": { "execution": { "iopub.execute_input": "2023-10-17T03:38:50.158406Z", "iopub.status.busy": "2023-10-17T03:38:50.158087Z", "iopub.status.idle": "2023-10-17T03:39:52.039009Z", "shell.execute_reply": "2023-10-17T03:39:52.037657Z", "shell.execute_reply.started": "2023-10-17T03:38:50.158382Z" }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "#\n", "# Fitting site-level model.\n", "# Starting optimization of 1248 parameters at Mon Oct 16 20:38:56 2023.\n", " step time_sec loss fit_loss reg_escape reg_spread reg_spatial reg_uniqueness reg_uniqueness2 reg_activity reg_hill_coefficient reg_non_neutralized_frac\n", " 0 0.076008 37318 37314 0 0 0 0 0 3.7281 0 0\n", " 191 16.705 4871.2 4822.4 39.515 0 0 0 0 9.2632 0 0\n", "# Successfully finished at Mon Oct 16 20:39:12 2023.\n", "#\n", "# Fitting model.\n", "# Starting optimization of 8450 parameters at Mon Oct 16 20:39:13 2023.\n", " step time_sec loss fit_loss reg_escape reg_spread reg_spatial reg_uniqueness reg_uniqueness2 reg_activity reg_hill_coefficient reg_non_neutralized_frac\n", " 0 0.11586 7306.6 7031.5 265.85 2.1605e-31 0 0 0 9.2632 0 0\n", " 200 27.761 6813 6701.1 89.589 11.718 0 0 0 10.559 0 0\n", " 282 38.933 6812.4 6700.3 89.705 11.873 0 0 0 10.564 0 0\n", "# Successfully finished at Mon Oct 16 20:39:52 2023.\n" ] } ], "source": [ "# NBVAL_IGNORE_OUTPUT\n", "\n", "model_sequential = polyclonal.Polyclonal(\n", " data_to_fit=(\n", " data_to_fit.drop(columns=\"aa_substitutions\").rename(\n", " columns={\n", " \"antibody_concentration\": \"concentration\",\n", " \"aa_substitutions_sequential\": \"aa_substitutions\",\n", " }\n", " )\n", " ),\n", " n_epitopes=1,\n", " alphabet=polyclonal.AAS_WITHSTOP_WITHGAP,\n", ")\n", "\n", "assert model_sequential.sequential_integer_sites is True\n", "\n", "_ = model_sequential.fit(\n", " logfreq=200,\n", " reg_escape_weight=0.1,\n", " fix_hill_coefficient=True,\n", " fix_non_neutralized_frac=True,\n", ")" ] }, { "cell_type": "markdown", "id": "ffe57ba3", "metadata": { "execution": { "iopub.execute_input": "2022-07-31T21:35:32.660923Z", "iopub.status.busy": "2022-07-31T21:35:32.660504Z", "iopub.status.idle": "2022-07-31T21:35:32.673648Z", "shell.execute_reply": "2022-07-31T21:35:32.672781Z", "shell.execute_reply.started": "2022-07-31T21:35:32.660880Z" }, "tags": [] }, "source": [ "Now make sure the fitting gives nearly the same result regardless of which site numbering is used.\n", "First check the activity values:" ] }, { "cell_type": "code", "execution_count": 13, "id": "c1f00fdf", "metadata": { "execution": { "iopub.execute_input": "2023-10-17T03:39:52.040920Z", "iopub.status.busy": "2023-10-17T03:39:52.040547Z", "iopub.status.idle": "2023-10-17T03:39:52.059166Z", "shell.execute_reply": "2023-10-17T03:39:52.058391Z", "shell.execute_reply.started": "2023-10-17T03:39:52.040886Z" }, "tags": [] }, "outputs": [], "source": [ "pd.testing.assert_frame_equal(\n", " model_sequential.curve_specs_df,\n", " model.curve_specs_df,\n", " atol=0.02,\n", ")" ] }, { "cell_type": "markdown", "id": "c725072b", "metadata": { "execution": { "iopub.execute_input": "2022-07-31T21:36:53.745074Z", "iopub.status.busy": "2022-07-31T21:36:53.744547Z", "iopub.status.idle": "2022-07-31T21:36:53.751777Z", "shell.execute_reply": "2022-07-31T21:36:53.750914Z", "shell.execute_reply.started": "2022-07-31T21:36:53.745027Z" }, "tags": [] }, "source": [ "Now compare the mutation-escape values.\n", "In order to do this, we have to re-number the sequential values to reference numbering:" ] }, { "cell_type": "code", "execution_count": 14, "id": "44a3804f", "metadata": { "execution": { "iopub.execute_input": "2023-10-17T03:39:52.060369Z", "iopub.status.busy": "2023-10-17T03:39:52.060063Z", "iopub.status.idle": "2023-10-17T03:39:52.341047Z", "shell.execute_reply": "2023-10-17T03:39:52.340427Z", "shell.execute_reply.started": "2023-10-17T03:39:52.060352Z" }, "tags": [] }, "outputs": [], "source": [ "min_times_seen = 50\n", "\n", "mut_escape = model.mut_escape_df.drop(columns=\"mutation\")\n", "\n", "mut_escape_sequential = model_sequential.mut_escape_df.assign(\n", " site=lambda x: x[\"site\"].map(\n", " site_numbering_map.set_index(\"sequential_site\")[\"reference_site\"].to_dict()\n", " )\n", ").drop(columns=\"mutation\")\n", "\n", "# have to use fairly big atol to test this, so also do correlations\n", "pd.testing.assert_frame_equal(\n", " mut_escape,\n", " mut_escape_sequential,\n", " atol=1.5,\n", ")\n", "assert 0.99 < mut_escape[\"escape\"].corr(mut_escape_sequential[\"escape\"])" ] }, { "cell_type": "markdown", "id": "a5961bd0", "metadata": {}, "source": [ "## Test of averaging\n", "Just as a toy test, average the reference-site model with itself and make sure it still displays reference-based site numbers:" ] }, { "cell_type": "code", "execution_count": 15, "id": "73015140", "metadata": { "execution": { "iopub.execute_input": "2023-10-17T03:39:52.342329Z", "iopub.status.busy": "2023-10-17T03:39:52.342070Z", "iopub.status.idle": "2023-10-17T03:39:59.274291Z", "shell.execute_reply": "2023-10-17T03:39:59.273547Z", "shell.execute_reply.started": "2023-10-17T03:39:52.342312Z" }, "tags": [] }, "outputs": [], "source": [ "avg_model = polyclonal.PolyclonalAverage(\n", " pd.DataFrame({\"number\": [1, 2], \"model\": [model, model]}),\n", ")" ] }, { "cell_type": "code", "execution_count": 16, "id": "37a56627", "metadata": { "execution": { "iopub.execute_input": "2023-10-17T03:39:59.276472Z", "iopub.status.busy": "2023-10-17T03:39:59.275639Z", "iopub.status.idle": "2023-10-17T03:40:02.984557Z", "shell.execute_reply": "2023-10-17T03:40:02.983695Z", "shell.execute_reply.started": "2023-10-17T03:39:59.276435Z" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "
\n", "" ], "text/plain": [ "alt.VConcatChart(...)" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# NBVAL_IGNORE_OUTPUT\n", "\n", "avg_model.mut_escape_plot()" ] }, { "cell_type": "markdown", "id": "deb1caba", "metadata": {}, "source": [ "You cannot average models that don't have the same site-numbering setup in terms of sequential or non-sequential:" ] }, { "cell_type": "code", "execution_count": 17, "id": "5f35ee94", "metadata": { "execution": { "iopub.execute_input": "2023-10-17T03:40:02.985774Z", "iopub.status.busy": "2023-10-17T03:40:02.985482Z", "iopub.status.idle": "2023-10-17T03:40:08.031096Z", "shell.execute_reply": "2023-10-17T03:40:08.030250Z", "shell.execute_reply.started": "2023-10-17T03:40:02.985756Z" }, "tags": [] }, "outputs": [ { "ename": "PolyclonalHarmonizeError", "evalue": "Models being harmonized don't have same `sequential_integer_sites`", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mPolyclonalHarmonizeError\u001b[0m Traceback (most recent call last)", "Cell \u001b[0;32mIn[17], line 4\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[38;5;66;03m# NBVAL_RAISES_EXCEPTION\u001b[39;00m\n\u001b[1;32m 2\u001b[0m \n\u001b[1;32m 3\u001b[0m \u001b[38;5;66;03m# this will give an error\u001b[39;00m\n\u001b[0;32m----> 4\u001b[0m \u001b[43mpolyclonal\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mPolyclonalAverage\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 5\u001b[0m \u001b[43m \u001b[49m\u001b[43mpd\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mDataFrame\u001b[49m\u001b[43m(\u001b[49m\u001b[43m{\u001b[49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mnumber\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m:\u001b[49m\u001b[43m \u001b[49m\u001b[43m[\u001b[49m\u001b[38;5;241;43m1\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m2\u001b[39;49m\u001b[43m]\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mmodel\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m:\u001b[49m\u001b[43m \u001b[49m\u001b[43m[\u001b[49m\u001b[43mmodel_sequential\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mmodel\u001b[49m\u001b[43m]\u001b[49m\u001b[43m}\u001b[49m\u001b[43m)\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 6\u001b[0m \u001b[43m)\u001b[49m\n", "File \u001b[0;32m~/polyclonal/polyclonal/polyclonal_collection.py:1459\u001b[0m, in \u001b[0;36mPolyclonalAverage.__init__\u001b[0;34m(self, models_df, harmonize_to, default_avg_to_plot)\u001b[0m\n\u001b[1;32m 1456\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m harmonize_to \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[1;32m 1457\u001b[0m harmonize_to \u001b[38;5;241m=\u001b[39m models_df\u001b[38;5;241m.\u001b[39miloc[\u001b[38;5;241m0\u001b[39m][\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mmodel\u001b[39m\u001b[38;5;124m\"\u001b[39m]\n\u001b[0;32m-> 1459\u001b[0m models_df[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mmodel\u001b[39m\u001b[38;5;124m\"\u001b[39m] \u001b[38;5;241m=\u001b[39m \u001b[43m[\u001b[49m\n\u001b[1;32m 1460\u001b[0m \u001b[43m \u001b[49m\u001b[43mm\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mepitope_harmonized_model\u001b[49m\u001b[43m(\u001b[49m\u001b[43mharmonize_to\u001b[49m\u001b[43m)\u001b[49m\u001b[43m[\u001b[49m\u001b[38;5;241;43m0\u001b[39;49m\u001b[43m]\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43;01mfor\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[43mm\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;129;43;01min\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[43mmodels_df\u001b[49m\u001b[43m[\u001b[49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mmodel\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m]\u001b[49m\n\u001b[1;32m 1461\u001b[0m \u001b[43m\u001b[49m\u001b[43m]\u001b[49m\n\u001b[1;32m 1463\u001b[0m \u001b[38;5;28msuper\u001b[39m()\u001b[38;5;241m.\u001b[39m\u001b[38;5;21m__init__\u001b[39m(models_df, default_avg_to_plot\u001b[38;5;241m=\u001b[39mdefault_avg_to_plot)\n", "File \u001b[0;32m~/polyclonal/polyclonal/polyclonal_collection.py:1460\u001b[0m, in \u001b[0;36m\u001b[0;34m(.0)\u001b[0m\n\u001b[1;32m 1456\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m harmonize_to \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[1;32m 1457\u001b[0m harmonize_to \u001b[38;5;241m=\u001b[39m models_df\u001b[38;5;241m.\u001b[39miloc[\u001b[38;5;241m0\u001b[39m][\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mmodel\u001b[39m\u001b[38;5;124m\"\u001b[39m]\n\u001b[1;32m 1459\u001b[0m models_df[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mmodel\u001b[39m\u001b[38;5;124m\"\u001b[39m] \u001b[38;5;241m=\u001b[39m [\n\u001b[0;32m-> 1460\u001b[0m \u001b[43mm\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mepitope_harmonized_model\u001b[49m\u001b[43m(\u001b[49m\u001b[43mharmonize_to\u001b[49m\u001b[43m)\u001b[49m[\u001b[38;5;241m0\u001b[39m] \u001b[38;5;28;01mfor\u001b[39;00m m \u001b[38;5;129;01min\u001b[39;00m models_df[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mmodel\u001b[39m\u001b[38;5;124m\"\u001b[39m]\n\u001b[1;32m 1461\u001b[0m ]\n\u001b[1;32m 1463\u001b[0m \u001b[38;5;28msuper\u001b[39m()\u001b[38;5;241m.\u001b[39m\u001b[38;5;21m__init__\u001b[39m(models_df, default_avg_to_plot\u001b[38;5;241m=\u001b[39mdefault_avg_to_plot)\n", "File \u001b[0;32m~/polyclonal/polyclonal/polyclonal.py:3233\u001b[0m, in \u001b[0;36mPolyclonal.epitope_harmonized_model\u001b[0;34m(self, ref_poly)\u001b[0m\n\u001b[1;32m 3227\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m PolyclonalHarmonizeError(\n\u001b[1;32m 3228\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mThe models being harmonized have different epitope labels:\u001b[39m\u001b[38;5;130;01m\\n\u001b[39;00m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 3229\u001b[0m \u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;132;01m{\u001b[39;00m\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mepitopes\u001b[38;5;132;01m=}\u001b[39;00m\u001b[38;5;130;01m\\n\u001b[39;00m\u001b[38;5;132;01m{\u001b[39;00mref_poly\u001b[38;5;241m.\u001b[39mepitopes\u001b[38;5;132;01m=}\u001b[39;00m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 3230\u001b[0m )\n\u001b[1;32m 3232\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39msequential_integer_sites \u001b[38;5;241m!=\u001b[39m ref_poly\u001b[38;5;241m.\u001b[39msequential_integer_sites:\n\u001b[0;32m-> 3233\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m PolyclonalHarmonizeError(\n\u001b[1;32m 3234\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mModels being harmonized don\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mt have same `sequential_integer_sites`\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 3235\u001b[0m )\n\u001b[1;32m 3237\u001b[0m corr_df \u001b[38;5;241m=\u001b[39m (\n\u001b[1;32m 3238\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mmut_escape_corr(ref_poly)\n\u001b[1;32m 3239\u001b[0m \u001b[38;5;241m.\u001b[39msort_values([\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mself_epitope\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mcorrelation\u001b[39m\u001b[38;5;124m\"\u001b[39m], ascending\u001b[38;5;241m=\u001b[39m[\u001b[38;5;28;01mTrue\u001b[39;00m, \u001b[38;5;28;01mFalse\u001b[39;00m])\n\u001b[1;32m 3240\u001b[0m \u001b[38;5;241m.\u001b[39mreset_index(drop\u001b[38;5;241m=\u001b[39m\u001b[38;5;28;01mTrue\u001b[39;00m)\n\u001b[1;32m 3241\u001b[0m )\n\u001b[1;32m 3243\u001b[0m harmonize_df \u001b[38;5;241m=\u001b[39m (\n\u001b[1;32m 3244\u001b[0m corr_df\u001b[38;5;241m.\u001b[39mrename(columns\u001b[38;5;241m=\u001b[39m{\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mself_epitope\u001b[39m\u001b[38;5;124m\"\u001b[39m: \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mself_initial_epitope\u001b[39m\u001b[38;5;124m\"\u001b[39m})\n\u001b[1;32m 3245\u001b[0m \u001b[38;5;241m.\u001b[39mgroupby(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mself_initial_epitope\u001b[39m\u001b[38;5;124m\"\u001b[39m, as_index\u001b[38;5;241m=\u001b[39m\u001b[38;5;28;01mFalse\u001b[39;00m)\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 3254\u001b[0m ]\n\u001b[1;32m 3255\u001b[0m )\n", "\u001b[0;31mPolyclonalHarmonizeError\u001b[0m: Models being harmonized don't have same `sequential_integer_sites`" ] } ], "source": [ "# NBVAL_RAISES_EXCEPTION\n", "\n", "# this will give an error\n", "polyclonal.PolyclonalAverage(\n", " pd.DataFrame({\"number\": [1, 2], \"model\": [model_sequential, model]}),\n", ")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.4" } }, "nbformat": 4, "nbformat_minor": 5 }