{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 4-Flatmapping\n",
    "This tutorial demonstrates how to split PDB structures into subcomponents or create biological assemblies. In Spark, a flatMap transformation splits each data record into zero or more records."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Import pyspark and mmtfPyspark"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pyspark.sql import SparkSession\n",
    "from mmtfPyspark.io import mmtfReader\n",
    "from mmtfPyspark.filters import ContainsDnaChain\n",
    "from mmtfPyspark.mappers import  StructureToBioassembly, StructureToPolymerChains, StructureToPolymerSequences\n",
    "from mmtfPyspark.structureViewer import view_structure\n",
    "from mmtfPyspark.utils import traverseStructureHierarchy\n",
    "import py3Dmol"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Configure Spark"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "spark = SparkSession.builder.appName(\"4-Flatmapping\").getOrCreate()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Read PDB structures\n",
    "In this example we download the hemoglobin structure 4HHB, consisting of two alpha subunits and two beta subunits."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "quaternary = mmtfReader.download_reduced_mmtf_files([\"4HHB\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "fb15e274d82c4e2cb3c70f87a7bb2a05",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "interactive(children=(IntSlider(value=0, continuous_update=False, description='Structure', max=0), Output()), …"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "view_structure(quaternary.keys().collect());"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Flatmap by protein sequence\n",
    "Here we extract the polymer sequences using a flatMap transformation. Chains A and C (alpha subunits) and chains B and D (beta subunits) have identical sequences, respectively. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('4HHB.A',\n",
       "  'VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR'),\n",
       " ('4HHB.B',\n",
       "  'VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH'),\n",
       " ('4HHB.C',\n",
       "  'VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR'),\n",
       " ('4HHB.D',\n",
       "  'VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH')]"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sequences = quaternary.flatMap(StructureToPolymerSequences())\n",
    "sequences.take(4)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Flatmap structures\n",
    "A flatMap operation splits data records into zero or more records. Here, we use the StructureToPolymerChains class to flatMap a PDB entry (quaternary structure) to its polymer chains (tertiary structure). Note, the chain Id is appended to the PDB Id. The two alpha subunit are 4HHB.A and 4HHB.C and the beta subunits are 4HHB.B and 4HHB.C."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['4HHB.A', '4HHB.B', '4HHB.C', '4HHB.D']"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tertiary = quaternary.flatMap(StructureToPolymerChains())\n",
    "tertiary.keys().collect()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "374cd341366446848af48eef3ff63aa3",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "interactive(children=(IntSlider(value=0, continuous_update=False, description='Structure', max=3), Output()), …"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "view_structure(tertiary.keys().collect());"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For some analyses we may only need one copy of each unique subunit (identical polymer sequence). This can be done by setting excludeDuplicates = True."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['4HHB.A', '4HHB.B']"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tertiary = quaternary.flatMap(StructureToPolymerChains(excludeDuplicates=True))\n",
    "tertiary.keys().collect()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Combine FlatMap with Filter\n",
    "The filter operations we used previously for whole structures can also be applied to single polymer chains. Here we flatMap PDB structures into polymer chains and then select select DNA chains."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "path = \"../resources/mmtf_reduced_sample\"\n",
    "\n",
    "dna_chains = mmtfReader \\\n",
    "    .read_sequence_file(path) \\\n",
    "    .flatMap(StructureToPolymerChains(excludeDuplicates=True)) \\\n",
    "    .filter(ContainsDnaChain())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "d937ab6dbc6049eb88d3ea11273ab2a7",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "interactive(children=(IntSlider(value=0, continuous_update=False, description='Structure', max=241), Output())…"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "view_structure(dna_chains.keys().collect());"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## FlatMap PDB structures to Biological Assemblies"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Read the asymmetric unit\n",
    "In this example we read the asymmetric unit of 1STP (Complex of Biotin with Streptavidin)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "asymmetric_unit = mmtfReader.download_full_mmtf_files([\"1STP\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Print some summary data about this structure"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "*** STRUCTURE DATA ***\n",
      "Number of models : 1\n",
      "Number of chains : 3\n",
      "Number of groups : 206\n",
      "Number of atoms : 1001\n",
      "Number of bonds : 940\n",
      "\n"
     ]
    }
   ],
   "source": [
    "traverseStructureHierarchy.print_structure_data(asymmetric_unit.first())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create the biological assembly from the asymmetric unit\n",
    "Now, we use a flatMap operation to map an asymmetric unit to one or more biological assemblies. In the case of 1STP, there is only one biological assembly, which represents a tetramer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "bio_assembly = asymmetric_unit.flatMap(StructureToBioassembly())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'1STP-BioAssembly1'"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bio_assembly.first()[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see, the biological assembly contains 4 copies of the asymmetric unit"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "*** STRUCTURE DATA ***\n",
      "Number of models : 1\n",
      "Number of chains : 12\n",
      "Number of groups : 824\n",
      "Number of atoms : 4004\n",
      "Number of bonds : 3280\n",
      "\n"
     ]
    }
   ],
   "source": [
    "traverseStructureHierarchy.print_structure_data(bio_assembly.first())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Shown below is the bioassembly for 1STP (tetramer)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "8c131b4c15c8486c8b1942c56f29052f",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "interactive(children=(IntSlider(value=0, continuous_update=False, description='Structure', max=0), Output()), …"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "view_structure([\"1STP\"], bioAssembly=True);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "spark.stop()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}