{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 2-JoiningDatasets\n", "This tutorial shows how to identify drug molecules in the PDB by joining two datasets: \n", "\n", "1. Drug information from DrugBank\n", "2. Ligand information from RCSB PDB" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from pyspark.sql import SparkSession\n", "from mmtfPyspark.datasets import customReportService, drugBankDataset\n", "from mmtfPyspark.structureViewer import view_binding_site" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Configure Spark" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "spark = SparkSession.builder.appName(\"2-JoiningDatasets\").getOrCreate()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Download open DrugBank dataset\n", "Download a dataset of drugs from [DrugBank](https://www.drugbank.ca) and filter out any drugs that do not have an InChIKey. [InChIKeys](https://en.wikipedia.org/wiki/International_Chemical_Identifier) are unique identifiers for small molecules. \n", "\n", "DrugBank provides more [detailed datasets](https://github.com/sbl-sdsc/mmtf-pyspark/blob/master/mmtfPyspark/datasets/drugBankDataset.py), e.g., subset of approved drugs, but a DrugBank username and password is required. For this tutorial we use the open DrugBank dataset." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DrugBankIDAccessionNumbersCommonnameCASUNIISynonymsStandardInChIKey
0DB00006BIOD00076 | BTD00076 | DB02351 | EXPT03302Bivalirudin128270-60-0TN9BEX005GBivalirudin | Bivalirudina | BivalirudinumOIRCOABEOLEUMC-GEJPAHFPSA-N
1DB00007BIOD00009 | BTD00009Leuprolide53714-56-0EFY6W0M8TGLeuprorelin | Leuprorelina | Leuproreline | Le...GFIJNRVAKGFPGQ-LIJARHBVSA-N
2DB00014BIOD00113 | BTD00113Goserelin65807-02-50F65R8P09NGoserelin | GoserelinaBLCLNMBMMGCOAS-URPVMXJPSA-N
3DB00027BIOD00036 | BTD00036Gramicidin D1405-97-65IE62321P4Bacillus brevis gramicidin D | Gramicidin | Gr...NDAYQJDHGXTBJL-MWWSRJDJSA-N
4DB00035BIOD00061 | BIOD00112 | BTD00061 | BTD00112Desmopressin16679-58-6ENR1LLB0FP1-(3-mercaptopropionic acid)-8-D-arginine-vaso...NFLWUMRGJYTJIN-PNIOQBSNSA-N
\n", "
" ], "text/plain": [ " DrugBankID AccessionNumbers Commonname \\\n", "0 DB00006 BIOD00076 | BTD00076 | DB02351 | EXPT03302 Bivalirudin \n", "1 DB00007 BIOD00009 | BTD00009 Leuprolide \n", "2 DB00014 BIOD00113 | BTD00113 Goserelin \n", "3 DB00027 BIOD00036 | BTD00036 Gramicidin D \n", "4 DB00035 BIOD00061 | BIOD00112 | BTD00061 | BTD00112 Desmopressin \n", "\n", " CAS UNII Synonyms \\\n", "0 128270-60-0 TN9BEX005G Bivalirudin | Bivalirudina | Bivalirudinum \n", "1 53714-56-0 EFY6W0M8TG Leuprorelin | Leuprorelina | Leuproreline | Le... \n", "2 65807-02-5 0F65R8P09N Goserelin | Goserelina \n", "3 1405-97-6 5IE62321P4 Bacillus brevis gramicidin D | Gramicidin | Gr... \n", "4 16679-58-6 ENR1LLB0FP 1-(3-mercaptopropionic acid)-8-D-arginine-vaso... \n", "\n", " StandardInChIKey \n", "0 OIRCOABEOLEUMC-GEJPAHFPSA-N \n", "1 GFIJNRVAKGFPGQ-LIJARHBVSA-N \n", "2 BLCLNMBMMGCOAS-URPVMXJPSA-N \n", "3 NDAYQJDHGXTBJL-MWWSRJDJSA-N \n", "4 NFLWUMRGJYTJIN-PNIOQBSNSA-N " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "drugs = drugBankDataset.get_open_drug_links()\n", "drugs = drugs.filter(\"StandardInChIKey IS NOT NULL\").cache()\n", "drugs.toPandas().head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# NOTE: RCSB PDB web services have been depreciated!\n", "New functionality needs to be developed to enable this example." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Download ligand annotations from RCSB PDB \n", "Here we use [RCSB PDB web services](http://dx.doi.org/10.1093/nar/gkq1021) to download InChIKeys and molecular weight for ligands in the PDB (this step can be slow!).\n", "\n", "We filter out entries without an InChIKey and low molecular weight ligands using SQL syntax." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# ligands = customReportService.get_dataset([\"ligandId\",\"InChIKey\",\"ligandMolecularWeight\"])\n", "\n", "# ligands = ligands.filter(\"InChIKey IS NOT NULL AND ligandMolecularWeight > 300\").cache()\n", "\n", "# ligands.toPandas().head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Find drugs in PDB\n", "By [joining](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=join#pyspark.sql.DataFrame.join) the two datasets on the InChIKey, we get the intersection between the two datasets." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# ligands = ligands.join(drugs, ligands.InChIKey == drugs.StandardInChIKey)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Keep only unique ligands per structure" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we [drop](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=join#pyspark.sql.DataFrame.dropDuplicates) rows with the same structureId and ligandId." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "# ligands = ligands.dropDuplicates([\"structureId\",\"ligandId\"]).cache()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Keep only essential columns" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "# ligands = ligands.select(\"structureId\",\"ligandId\",\"chainId\",\"Commonname\")\n", "# ligands.toPandas().head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Visualize drug binding sites" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Extract id columns as lists (required for visualization)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "# pdb_ids = ligands.select(\"structureId\").rdd.flatMap(lambda x: x).collect()\n", "# ligand_ids = ligands.select(\"ligandId\").rdd.flatMap(lambda x: x).collect()\n", "# chain_ids = ligands.select(\"chainId\").rdd.flatMap(lambda x: x).collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Disable scrollbar for the visualization below" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "#%%javascript \n", "#IPython.OutputArea.prototype._should_scroll = function(lines) {return false;}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Show binding site residues within 4.5 A from the drug molecule" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "scrolled": true }, "outputs": [], "source": [ "# view_binding_site(pdb_ids, ligand_ids, chain_ids, distance=4.5);" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "spark.stop()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.13" } }, "nbformat": 4, "nbformat_minor": 4 }