{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 2-JoiningDatasets\n",
    "This tutorial shows how to identify drug molecules in the PDB by joining two datasets: \n",
    "\n",
    "1. Drug information from DrugBank\n",
    "2. Ligand information from RCSB PDB"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pyspark.sql import SparkSession\n",
    "from mmtfPyspark.datasets import customReportService, drugBankDataset\n",
    "from mmtfPyspark.structureViewer import view_binding_site"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Configure Spark"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "spark = SparkSession.builder.appName(\"2-JoiningDatasets\").getOrCreate()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Download open DrugBank dataset\n",
    "Download a dataset of drugs from [DrugBank](https://www.drugbank.ca) and filter out any drugs that do not have an InChIKey. [InChIKeys](https://en.wikipedia.org/wiki/International_Chemical_Identifier) are unique identifiers for small molecules. \n",
    "\n",
    "DrugBank provides more [detailed datasets](https://github.com/sbl-sdsc/mmtf-pyspark/blob/master/mmtfPyspark/datasets/drugBankDataset.py), e.g., subset of approved drugs, but a DrugBank username and password is required. For this tutorial we use the open DrugBank dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>DrugBankID</th>\n",
       "      <th>AccessionNumbers</th>\n",
       "      <th>Commonname</th>\n",
       "      <th>CAS</th>\n",
       "      <th>UNII</th>\n",
       "      <th>Synonyms</th>\n",
       "      <th>StandardInChIKey</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>DB00006</td>\n",
       "      <td>BIOD00076 | BTD00076 | DB02351 | EXPT03302</td>\n",
       "      <td>Bivalirudin</td>\n",
       "      <td>128270-60-0</td>\n",
       "      <td>TN9BEX005G</td>\n",
       "      <td>Bivalirudin | Bivalirudina | Bivalirudinum</td>\n",
       "      <td>OIRCOABEOLEUMC-GEJPAHFPSA-N</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>DB00007</td>\n",
       "      <td>BIOD00009 | BTD00009</td>\n",
       "      <td>Leuprolide</td>\n",
       "      <td>53714-56-0</td>\n",
       "      <td>EFY6W0M8TG</td>\n",
       "      <td>Leuprorelin | Leuprorelina | Leuproreline | Le...</td>\n",
       "      <td>GFIJNRVAKGFPGQ-LIJARHBVSA-N</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>DB00014</td>\n",
       "      <td>BIOD00113 | BTD00113</td>\n",
       "      <td>Goserelin</td>\n",
       "      <td>65807-02-5</td>\n",
       "      <td>0F65R8P09N</td>\n",
       "      <td>Goserelin | Goserelina</td>\n",
       "      <td>BLCLNMBMMGCOAS-URPVMXJPSA-N</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>DB00027</td>\n",
       "      <td>BIOD00036 | BTD00036</td>\n",
       "      <td>Gramicidin D</td>\n",
       "      <td>1405-97-6</td>\n",
       "      <td>5IE62321P4</td>\n",
       "      <td>Bacillus brevis gramicidin D | Gramicidin | Gr...</td>\n",
       "      <td>NDAYQJDHGXTBJL-MWWSRJDJSA-N</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>DB00035</td>\n",
       "      <td>BIOD00061 | BIOD00112 | BTD00061 | BTD00112</td>\n",
       "      <td>Desmopressin</td>\n",
       "      <td>16679-58-6</td>\n",
       "      <td>ENR1LLB0FP</td>\n",
       "      <td>1-(3-mercaptopropionic acid)-8-D-arginine-vaso...</td>\n",
       "      <td>NFLWUMRGJYTJIN-PNIOQBSNSA-N</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  DrugBankID                             AccessionNumbers    Commonname  \\\n",
       "0    DB00006   BIOD00076 | BTD00076 | DB02351 | EXPT03302   Bivalirudin   \n",
       "1    DB00007                         BIOD00009 | BTD00009    Leuprolide   \n",
       "2    DB00014                         BIOD00113 | BTD00113     Goserelin   \n",
       "3    DB00027                         BIOD00036 | BTD00036  Gramicidin D   \n",
       "4    DB00035  BIOD00061 | BIOD00112 | BTD00061 | BTD00112  Desmopressin   \n",
       "\n",
       "           CAS        UNII                                           Synonyms  \\\n",
       "0  128270-60-0  TN9BEX005G         Bivalirudin | Bivalirudina | Bivalirudinum   \n",
       "1   53714-56-0  EFY6W0M8TG  Leuprorelin | Leuprorelina | Leuproreline | Le...   \n",
       "2   65807-02-5  0F65R8P09N                             Goserelin | Goserelina   \n",
       "3    1405-97-6  5IE62321P4  Bacillus brevis gramicidin D | Gramicidin | Gr...   \n",
       "4   16679-58-6  ENR1LLB0FP  1-(3-mercaptopropionic acid)-8-D-arginine-vaso...   \n",
       "\n",
       "              StandardInChIKey  \n",
       "0  OIRCOABEOLEUMC-GEJPAHFPSA-N  \n",
       "1  GFIJNRVAKGFPGQ-LIJARHBVSA-N  \n",
       "2  BLCLNMBMMGCOAS-URPVMXJPSA-N  \n",
       "3  NDAYQJDHGXTBJL-MWWSRJDJSA-N  \n",
       "4  NFLWUMRGJYTJIN-PNIOQBSNSA-N  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "drugs = drugBankDataset.get_open_drug_links()\n",
    "drugs = drugs.filter(\"StandardInChIKey IS NOT NULL\").cache()\n",
    "drugs.toPandas().head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# NOTE: RCSB PDB web services have been depreciated!\n",
    "New functionality needs to be developed to enable this example."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Download ligand annotations from RCSB PDB \n",
    "Here we use [RCSB PDB web services](http://dx.doi.org/10.1093/nar/gkq1021) to download InChIKeys and molecular weight for ligands in the PDB (this step can be slow!).\n",
    "\n",
    "We filter out entries without an InChIKey and low molecular weight ligands using SQL syntax."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ligands = customReportService.get_dataset([\"ligandId\",\"InChIKey\",\"ligandMolecularWeight\"])\n",
    "\n",
    "# ligands = ligands.filter(\"InChIKey IS NOT NULL AND ligandMolecularWeight > 300\").cache()\n",
    "\n",
    "# ligands.toPandas().head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Find drugs in PDB\n",
    "By [joining](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=join#pyspark.sql.DataFrame.join) the two datasets on the InChIKey, we get the intersection between the two datasets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ligands = ligands.join(drugs, ligands.InChIKey == drugs.StandardInChIKey)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Keep only unique ligands per structure"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we [drop](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=join#pyspark.sql.DataFrame.dropDuplicates) rows with the same structureId and ligandId."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ligands = ligands.dropDuplicates([\"structureId\",\"ligandId\"]).cache()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Keep only essential columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ligands = ligands.select(\"structureId\",\"ligandId\",\"chainId\",\"Commonname\")\n",
    "# ligands.toPandas().head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Visualize drug binding sites"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Extract id columns as lists (required for visualization)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "# pdb_ids = ligands.select(\"structureId\").rdd.flatMap(lambda x: x).collect()\n",
    "# ligand_ids = ligands.select(\"ligandId\").rdd.flatMap(lambda x: x).collect()\n",
    "# chain_ids = ligands.select(\"chainId\").rdd.flatMap(lambda x: x).collect()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Disable scrollbar for the visualization below"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "#%%javascript \n",
    "#IPython.OutputArea.prototype._should_scroll = function(lines) {return false;}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Show binding site residues within 4.5 A from the drug molecule"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# view_binding_site(pdb_ids, ligand_ids, chain_ids, distance=4.5);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "spark.stop()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}