{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<!--NOTEBOOK_HEADER-->\n",
    "*This notebook contains material from [PyRosetta](https://RosettaCommons.github.io/PyRosetta.notebooks);\n",
    "content is available [on Github](https://github.com/RosettaCommons/PyRosetta.notebooks.git).*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<!--NAVIGATION-->\n",
    "< [Distributed analysis example: exhaustive ddG PSSM](http://nbviewer.jupyter.org/github/RosettaCommons/PyRosetta.notebooks/blob/master/notebooks/16.01-PyData-ddG-pssm.ipynb) | [Contents](toc.ipynb) | [Index](index.ipynb) | [Example of Using PyRosetta with GNU Parallel](http://nbviewer.jupyter.org/github/RosettaCommons/PyRosetta.notebooks/blob/master/notebooks/16.03-GNU-Parallel-Via-Slurm.ipynb) ><p><a href=\"https://colab.research.google.com/github/RosettaCommons/PyRosetta.notebooks/blob/master/notebooks/16.02-PyData-miniprotein-design.ipynb\"><img align=\"left\" src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open in Colab\" title=\"Open in Google Colaboratory\"></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Distributed computation example: miniprotein design\n",
    "\n",
    "## Notes\n",
    "This tutorial will walk you through how to design miniproteins in PyRosetta using the PyData stack for analysis and distributed computing.\n",
    "\n",
    "This Jupyter notebook uses parallelization and is not meant to be executed within a Google Colab environment.\n",
    "\n",
    "## Setup\n",
    "Please see setup instructions in Chapter 16.00\n",
    "\n",
    "## Citation\n",
    "[Integration of the Rosetta Suite with the Python Software Stack via reproducible packaging and core programming interfaces for distributed simulation](https://doi.org/10.1002/pro.3721)\n",
    "\n",
    "Alexander S. Ford, Brian D. Weitzner, Christopher D. Bahl\n",
    "\n",
    "## Manual\n",
    "Documentation for the `pyrosetta.distributed` namespace can be found here: https://nbviewer.jupyter.org/github/proteininnovation/Rosetta-PyData_Integration/blob/master/distributed_overview.ipynb"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install pyrosettacolabsetup\n",
    "import pyrosettacolabsetup; pyrosettacolabsetup.install_pyrosetta()\n",
    "import pyrosetta; pyrosetta.init()\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pyrosetta\n",
    "import dask.bag\n",
    "import dask.distributed"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import pandas\n",
    "import seaborn\n",
    "from matplotlib import pylab"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pyrosetta.distributed.tasks.score import ScorePoseTask\n",
    "from pyrosetta.distributed.io import pose_from_pdbstring\n",
    "from pyrosetta.distributed.packed_pose import to_dict"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import zipfile"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Init\n",
    "\n",
    "Setup LocalCluster to full utilize the current machine."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if not os.getenv(\"DEBUG\"):\n",
    "    cluster = dask.distributed.LocalCluster(n_workers=1, threads_per_worker=1)\n",
    "    client = dask.distributed.Client(cluster)\n",
    "else:\n",
    "    client = None\n",
    "client"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Load \n",
    "\n",
    "Load decoys from batch compute run, adding annotations for designed sequence, cystine count and cystine location."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def load_source(decoy):\n",
    "    src_pdb = zipfile.ZipFile(library).open(decoy).read()\n",
    "    p = pose_from_pdbstring(src_pdb)\n",
    "    \n",
    "    cys_locations=[i for i, c in enumerate(p.pose.sequence()) if c == \"C\"]\n",
    "    p = p.update_scores(\n",
    "        library=library,\n",
    "        decoy=decoy,\n",
    "        sequence=p.pose.sequence(),\n",
    "        num_res = len(p.pose.sequence()),\n",
    "        num_cys=len(cys_locations),\n",
    "        cys_locations=\",\".join(map(str, cys_locations))\n",
    "    )\n",
    "\n",
    "    return p"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if not os.getenv(\"DEBUG\"):\n",
    "    library = \"inputs/EHEE_library.zip\"\n",
    "    decoy_names = [f.filename for f in zipfile.ZipFile(library).filelist if f.filename.endswith(\".pdb\")]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if not os.getenv(\"DEBUG\"):\n",
    "    decoys = dask.bag.from_sequence(decoy_names).map(load_source).map(ScorePoseTask()).persist()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if not os.getenv(\"DEBUG\"):\n",
    "    result_frame = pandas.DataFrame.from_records(decoys.map(to_dict).compute()).sort_values(\"total_score\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Analysis\n",
    "\n",
    "We anticipate a distribution of score results, with higher scores with more disulfide insertions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if not os.getenv(\"DEBUG\"):\n",
    "    seaborn.boxplot(x=\"num_cys\", y=\"total_score\", data=result_frame)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if not os.getenv(\"DEBUG\"):\n",
    "    seaborn.boxplot(x=\"cys_locations\", y=\"total_score\", data=result_frame)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Select \n",
    "\n",
    "Select the best model by total_Score for each inserted disulfide location, allowing us to test a variety of disulfide architectures."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if not os.getenv(\"DEBUG\"):\n",
    "    best_by_location = to_dict(result_frame.groupby(\"cys_locations\").head(1).reset_index(drop=True))\n",
    "    print(len(best_by_location))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if not os.getenv(\"DEBUG\"):\n",
    "    with open(\"EHEE.best_by_location.fasta\", \"w\") as out:\n",
    "        for entry in best_by_location:\n",
    "            print(f\">{entry['decoy']}\", file=out)\n",
    "            print(entry['sequence'], file=out)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if not os.getenv(\"DEBUG\"):\n",
    "    !head expected_outputs/EHEE.best_by_location.fasta"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Visualize \n",
    "\n",
    "Visualize using Py3Dmol or PyMolMover as you have learned before. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "'''\n",
    "import py3Dmol\n",
    "view = py3Dmol.view(viewergrid=(3, 3), linked=False, width=900, height=900)\n",
    "for i in range(9):\n",
    "    view.addModel( pyrosetta.distributed.io.to_pdbstring(best_by_location[i]), \"pdb\", viewer=(i/3, i%3),)\n",
    "    \n",
    "view.setStyle({'cartoon':{'color':'spectrum'}})\n",
    "view.setStyle({\"resn\": \"CYS\"}, {'stick': {}, 'cartoon':{'color':'spectrum'}} )\n",
    "view.zoomTo()\n",
    "'''"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<!--NAVIGATION-->\n",
    "< [Distributed analysis example: exhaustive ddG PSSM](http://nbviewer.jupyter.org/github/RosettaCommons/PyRosetta.notebooks/blob/master/notebooks/16.01-PyData-ddG-pssm.ipynb) | [Contents](toc.ipynb) | [Index](index.ipynb) | [Example of Using PyRosetta with GNU Parallel](http://nbviewer.jupyter.org/github/RosettaCommons/PyRosetta.notebooks/blob/master/notebooks/16.03-GNU-Parallel-Via-Slurm.ipynb) ><p><a href=\"https://colab.research.google.com/github/RosettaCommons/PyRosetta.notebooks/blob/master/notebooks/16.02-PyData-miniprotein-design.ipynb\"><img align=\"left\" src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open in Colab\" title=\"Open in Google Colaboratory\"></a>"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [conda env:PyRosetta.notebooks]",
   "language": "python",
   "name": "conda-env-PyRosetta.notebooks-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}