{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<!--NOTEBOOK_HEADER-->\n",
    "*This notebook contains material from [PyRosetta](https://RosettaCommons.github.io/PyRosetta.notebooks);\n",
    "content is available [on Github](https://github.com/RosettaCommons/PyRosetta.notebooks.git).*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<!--NAVIGATION-->\n",
    "< [Setting up a membrane protein in the bilayer](http://nbviewer.jupyter.org/github/RosettaCommons/PyRosetta.notebooks/blob/master/notebooks/15.01-Accounting-for-the-lipid-bilayer.ipynb) | [Contents](toc.ipynb) | [Index](index.ipynb) | [Running Rosetta in Parallel](http://nbviewer.jupyter.org/github/RosettaCommons/PyRosetta.notebooks/blob/master/notebooks/16.00-Running-PyRosetta-in-Parallel.ipynb) ><p><a href=\"https://colab.research.google.com/github/RosettaCommons/PyRosetta.notebooks/blob/master/notebooks/15.02-Membrane-Protein-ddG-of-mutation.ipynb\"><img align=\"left\" src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open in Colab\" title=\"Open in Google Colaboratory\"></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Predicting the ∆∆G of single point mutations"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Accurately estimating the thermodynamic cost of a mutation is a building block of protein engineering and design. This task is especially tricky for membrane proteins because the calculations must account for the lipid bilayer. In this tutorial, we will walk through the protocol for estimating the ∆∆G for lipid facing positions using RosettaMP and the _franklin2019_ energy function. \n",
    "\n",
    "As an example, we will examine mutations in the integral membrane enzyme PagP. PagP is a beta-barrel protein that transfers a palmitoyl group fron the sn-1 position of a glycerophospholipid to the endotoxin of lipopolysacharide (LPS). The enzyme provides bacterial resistance to host immune defenses such as antimicrobial pepetides (Guo et al. 1998; Kawasaki et al. 2004). Recently, Marx & Fleming measured the energetic cost of making mutations at the V111 position on PagP (Marx & Fleming, 2017). Here, we will perform the same set of mutations with Rosetta and compare with the experimental values. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Objectives\n",
    " - Setup a protein in an implicit lipid bilayer\n",
    " - Compute the ∆∆G of mutation\n",
    " - Analyze contributions to the change in stability\n",
    " - Visualize the model in PyMOL"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## PyRosetta Initialization & Setup"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The first step is to initialize PyRosetta and load the protein of interest. In this tutorial, we will use PagP (PDB 3GP6). The starting structure is from the Orientations of Proteins in Membranes database (https://opm.phar.umich.edu/) which provides spatial arrangements of membrane proteins in the lipid bilayer. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install pyrosettacolabsetup\n",
    "import pyrosettacolabsetup; pyrosettacolabsetup.install_pyrosetta()\n",
    "import pyrosetta; pyrosetta.init()\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pyrosetta import *\n",
    "init( extra_options=\"-mp:lipids:has_pore false\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Make sure you are in the right directory for accessing the `.pdb` files:\n",
    "\n",
    "`cd google_drive/MyDrive/student-notebooks/`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#cd google_drive/MyDrive/student-notebooks/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pose = pose_from_pdb( \"inputs/3gp6_A.pdb\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, initialize the protein in the membrane using `AddMembraneMover.` Here, the protein is already oriented in the bilayer so we can estimate the transmembrane spans from the structure and orientation. Thus, we use the `from_structure` option to initialize the spanning topology. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pyrosetta.rosetta.protocols.membrane import *\n",
    "add_memb = AddMembraneMover(\"from_structure\")\n",
    "add_memb.apply(pose)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## ∆∆G of mutation calculations\n",
    "\n",
    "Next, we will compute the ∆∆G for several point mutations in PagP. In the Marx & Fleming experiment, position V111 was first mutated to alanine. Therefore, we will create this variant first. We will use the `mutate_residue` function from the `predict_ddG` PyRosetta module included in this package. In this tutorial, we will use a repack radius of 8.0 Å. \n",
    "\n",
    "An important note - Pyrosetta residue numbering may differ from the PDB numbering because PyRosetta requires continuous numbering for calculations. Here, the PyRosetta residue number for V111 is 104. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from additional_scripts import predict_ddG\n",
    "# Create a franklin2019 energy function\n",
    "sfxn = create_score_function(\"franklin2019\")\n",
    "# Repack and score the native conformation\n",
    "reference_pose = predict_ddG.mutate_residue(pose, 104, \"A\", 8.0, sfxn)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To demonstrate the ΔΔG calculation, we will now compute the energetic cost\n",
    "of mutating alanine to tryptophan."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Score the alanine reference pose\n",
    "score_A111 = sfxn.score(reference_pose)\n",
    "# Repack and score the L111 conformation\n",
    "pose_W111 = predict_ddG.mutate_residue(pose, 104, \"W\", 8.0, sfxn)\n",
    "score_W111 = sfxn.score(pose_W111)\n",
    "# Compute the ddG of mutation as mutant_score - native_score (final-initial\n",
    "ddG = score_W111 - score_A111\n",
    "print(ddG)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The ΔΔG for mutating alanine to tryptophan at position 111 is -1.84 Rosetta Energy Units (REU). A\n",
    "Rosetta Energy Unit is an arbitrary unit for the Rosetta energy function. Next, we would like to\n",
    "compute the ΔΔG for mutating alanine to all 19 canonical amino acids. To do so, we will generalize\n",
    "the code above into a function for easy calculations of multiple single point mutations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def compute_ddG(pose, native_res, site_no, mutant_res, sfxn): \n",
    "    \"\"\"A function for computing the ddG of single point mutations\n",
    "    \n",
    "    Example: \n",
    "        $ compute_ddG(pose, \"V\", 49, \"A\", sfxn)\n",
    "        \n",
    "    Arguments: \n",
    "        - pose = Object containing the coordinates for the biomolecular system\n",
    "        - native_res = Native amino acid\n",
    "        - site_no = Host site amino acid position\n",
    "        - mutant_res = Mutant amino acid\n",
    "        = sfxn = Score function object\n",
    "    \"\"\"\n",
    "    \n",
    "    repacked_native = predict_ddG.mutate_residue(pose, site_no, native_res, 8.0, sfxn)\n",
    "    native_score = sfxn.score(repacked_native)\n",
    "    repacked_mutant = predict_ddG.mutate_residue(pose, site_no, mutant_res, 8.0, sfxn)\n",
    "    mutant_score = sfxn.score(repacked_mutant)\n",
    "    ddG = mutant_score - native_score\n",
    "    return ddG"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we will write a loop that computes the ΔΔG for all canonical amino acids and store the results\n",
    "in a python dictionary."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# List of canonical amino acid one-letter codes\n",
    "amino_acids = [ 'A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y' ]\n",
    "# Initialize an empty dictionary to store the data\n",
    "ddG_data = {}\n",
    "# Loop through all amino acids\n",
    "for aa in amino_acids:\n",
    "    ddG = compute_ddG(reference_pose, 'A', 104, aa, sfxn)\n",
    "    ddG_data[ aa ] = ddG"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The next step is to compare the ΔΔG predictions to the experimentally measured values. The experimental data for Marx & Fleming are located in `inputs` in a file called `PagP_Marx_Fleming_set.dat`. We will parse the file and then import these values into a |dictionary."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Read contents of file into a list (file format is space delimited)\n",
    "with open( 'inputs/PagP_Marx_Fleming_set.dat', 'rt' ) as f:\n",
    "    data = f.readlines()\n",
    "    data = [ x.strip() for x in data ]\n",
    "    data = [ x.split(' ') for x in data ]\n",
    "\n",
    "# Convert the list into a dictionary\n",
    "exp_ddG_data = {}\n",
    "for i in range(1, len(data)):\n",
    "    exp_ddG_data[ data[i][2] ] = float(data[i][3])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now convert the dictionary format to numpy arrays that are compatible with analysis. Here, we\n",
    "will compute the correlation coefficient and make a scatterplot of the experimentally measured vs.\n",
    "predicted values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "mutations = np.asarray( ddG_data.keys() )\n",
    "ddG_values = np.asarray( list(ddG_data.values()) )\n",
    "exp_values = np.asarray( list(exp_ddG_data.values()) )\n",
    "\n",
    "# Compute the correlation coefficient\n",
    "corr = np.corrcoef( exp_values, ddG_values )\n",
    "print(corr[0,1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We initially find that the correlation coefficient is low (0.376). We will want to find any outliers in the dataset that are lowering this value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def find_outliers(x):\n",
    "    upper_quartile = np.percentile(x, 75)\n",
    "    lower_quartile = np.percentile(x, 25)\n",
    "    IQR = (upper_quartile - lower_quartile)\n",
    "    quartile_set = (lower_quartile - IQR, upper_quartile + IQR)\n",
    "    for y in x.tolist():\n",
    "        if (y < quartile_set[0]) or (y > quartile_set[1]):\n",
    "            print(y)\n",
    "find_outliers(ddG_values)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here, the ΔΔG value for proline is an outlier. We will investigate this more later. For now, we will remove it from the set and then recompute the correlation coefficient."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Proline is the 13th, amino acid of 20\n",
    "exp_data_no_P = []\n",
    "pred_data_no_P = []\n",
    "for i in range(0, 20):\n",
    "    if ( i != 12 ):\n",
    "        exp_data_no_P.append( list(exp_ddG_data.values())[i] )\n",
    "        pred_data_no_P.append( list(ddG_data.values())[i] )\n",
    "\n",
    "exp_data_no_P = np.asarray( exp_data_no_P )\n",
    "pred_data_no_P = np.asarray( pred_data_no_P )\n",
    "corr = np.corrcoef( exp_data_no_P, pred_data_no_P )\n",
    "print(corr[0,1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The new value of R = 0.692 is much more encouraging! Next, we will visualized the predicted vs. experimentally measured values with a scatterplot."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "plt.axhline(y=0, color='k', linestyle='-')\n",
    "plt.axvline(x=0, color='k', linestyle=\"-\")\n",
    "plt.ylim([-6,6])\n",
    "plt.xlim([-6,6])\n",
    "\n",
    "# compute the best fit line\n",
    "from numpy.polynomial.polynomial import polyfit\n",
    "b, m = polyfit(exp_data_no_P, pred_data_no_P, 1)\n",
    "x = np.linspace(-6, 6, num=50)\n",
    "plt.plot(x, b + m * x, color='r', linestyle='-')\n",
    "\n",
    "# plot the data\n",
    "plt.scatter(exp_data_no_P, pred_data_no_P)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we would like to use the models to learn why some mutations stabilized PagP, whereas other side chains did not. Of course, we need a metric for identifying the most confident predictions, especially since the correlation coefficient is not perfect. To do so, we will compute the residuals from the line of best fit and set an empirical cutoff of 1.5 REU."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import seaborn as sns\n",
    "sns.set(style=\"whitegrid\")\n",
    "resid = sns.residplot(x=exp_data_no_P, y=pred_data_no_P, color=\"b\")\n",
    "resid.set_ylabel(\"Residual\")\n",
    "resid.set_xlabel(\"Exp (kcal/mol)\")\n",
    "print(exp_data_no_P)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here, we find five ∆∆G values that are predicted outside of the cutoff: Glycine, Leucine, Valine, Tryptophan, and Threonine. Next, we will use this information to hypothesize a mechanism for a reasonable prediction (lysine) and rationalize incorrect predictions for proline and leucine. The first step is to quantify which energy components make the largest contribution to the overall ∆∆G of mutation. To do so, we will write a function that can extrapolate this information from the energy function. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Store models of mutated PagP proteins\n",
    "mutant_tyr = predict_ddG.mutate_residue(pose, 104, \"Y\", 8.0, sfxn)\n",
    "mutant_lys = predict_ddG.mutate_residue(pose, 104, \"K\", 8.0, sfxn)\n",
    "mutant_leu = predict_ddG.mutate_residue(pose, 104, \"L\", 8.0, sfxn)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_energy_components( native_pose, mutated_pose, sfxn): \n",
    "\n",
    "    # Extract & parse scores\n",
    "    tmp_native = native_pose.energies().total_energies().weighted_string_of( sfxn.weights() )\n",
    "    tmp_mutant = mutated_pose.energies().total_energies().weighted_string_of( sfxn.weights() )\n",
    "    array_native = list(filter( None, tmp_native.split(' ') ))\n",
    "    array_mutant = list(filter( None, tmp_mutant.split(' ') ))\n",
    "\n",
    "    # Pull out only the scores from these arrays\n",
    "    native_scores = []\n",
    "    for i in range( len(array_native) ): \n",
    "        if ( i % 2 != 0 ): \n",
    "            native_scores.append( float( array_native[i] ) )\n",
    "\n",
    "    mutant_scores = []\n",
    "    for i in range( len(array_mutant) ): \n",
    "        if ( i % 2 != 0 ): \n",
    "            mutant_scores.append( float( array_mutant[i] ) )\n",
    "\n",
    "    # Calculate ddG of individual components\n",
    "    ddGs = []\n",
    "    for i in range( len( mutant_scores ) ): \n",
    "        ddG_component = mutant_scores[i] - native_scores[i]\n",
    "        ddGs.append( round( ddG_component, 3 ) )\n",
    "\n",
    "    # Get labels\n",
    "    labels = []\n",
    "    for i in range( len(array_native) ): \n",
    "        if ( i % 2 == 0 ): \n",
    "            labels.append( array_native[i].translate(':').strip(\":\") )\n",
    "\n",
    "    return labels, ddGs\n",
    "\n",
    "# Compute the ddG breakdown\n",
    "labels, tyr_ddGs = get_energy_components( reference_pose, mutant_tyr, sfxn )\n",
    "labels, lys_ddGs = get_energy_components( reference_pose, mutant_lys, sfxn )\n",
    "labels, leu_ddGs = get_energy_components( reference_pose, mutant_leu, sfxn )\n",
    "print(labels)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we will make a bar graph to visualize the contributions of each energy component for the ddG of these three single point mutations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x = np.arange(len(labels))  \n",
    "width = 1  # the width of the bars\n",
    "\n",
    "# Plot for tyrosine\n",
    "fig, ax = plt.subplots()\n",
    "rects1 = ax.bar(x, tyr_ddGs, width, label='Y')\n",
    "ax.set_ylabel('∆∆G (REU)')\n",
    "ax.set_title('Contributions to the ∆∆G of Mutation for A104Y')\n",
    "ax.set_xticks(x)\n",
    "ax.set_xticklabels(labels, rotation='vertical')\n",
    "ax.legend()\n",
    "fig.tight_layout()\n",
    "\n",
    "# Plot for lysine\n",
    "fig, ax = plt.subplots()\n",
    "rects2 = ax.bar(x, lys_ddGs, width, label='K')\n",
    "ax.set_ylabel('∆∆G (REU)')\n",
    "ax.set_title('Contributions to the ∆∆G of Mutation for A104K')\n",
    "ax.set_xticks(x)\n",
    "ax.set_xticklabels(labels, rotation='vertical')\n",
    "ax.legend()\n",
    "fig.tight_layout()\n",
    "\n",
    "# Plot for lysine\n",
    "fig, ax = plt.subplots()\n",
    "rects3 = ax.bar(x, leu_ddGs, width, label='L' )\n",
    "ax.set_ylabel('∆∆G (REU)')\n",
    "ax.set_title('Contributions to the ∆∆G of Mutation for A104L')\n",
    "ax.set_xticks(x)\n",
    "ax.set_xticklabels(labels, rotation='vertical')\n",
    "ax.legend()\n",
    "fig.tight_layout()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we will export the model files so that we can visualize them in PyMOL. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "reference_pose.dump_pdb( \"PagP_ala_ref.pdb\" )\n",
    "mutant_tyr.dump_pdb( \"PagP_A104Y.pdb\" )\n",
    "mutant_lys.dump_pdb( \"PagP_A104K.pdb\" )\n",
    "mutant_leu.dump_pdb( \"PagP_A104L.pdb\" )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### TODO List\n",
    " - make notes about lipid composiiton above\n",
    " - view in pymol\n",
    " - hypothesize mechanisms\n",
    " - add notes about what to do without experimental information\n",
    " - break into sections as prescribed above"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<!--NAVIGATION-->\n",
    "< [Setting up a membrane protein in the bilayer](http://nbviewer.jupyter.org/github/RosettaCommons/PyRosetta.notebooks/blob/master/notebooks/15.01-Accounting-for-the-lipid-bilayer.ipynb) | [Contents](toc.ipynb) | [Index](index.ipynb) | [Running Rosetta in Parallel](http://nbviewer.jupyter.org/github/RosettaCommons/PyRosetta.notebooks/blob/master/notebooks/16.00-Running-PyRosetta-in-Parallel.ipynb) ><p><a href=\"https://colab.research.google.com/github/RosettaCommons/PyRosetta.notebooks/blob/master/notebooks/15.02-Membrane-Protein-ddG-of-mutation.ipynb\"><img align=\"left\" src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open in Colab\" title=\"Open in Google Colaboratory\"></a>"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.0"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}