{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 2-JoiningDatasets\n", "This tutorial shows how to identify drug molecules in the PDB by joining two datasets: \n", "\n", "1. Drug information from DrugBank\n", "2. Ligand information from RCSB PDB" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from pyspark.sql import SparkSession\n", "from mmtfPyspark.datasets import customReportService, drugBankDataset\n", "from mmtfPyspark.structureViewer import view_binding_site" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Configure Spark" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "spark = SparkSession.builder.appName(\"2-JoiningDatasets\").getOrCreate()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Download open DrugBank dataset\n", "Download a dataset of drugs from [DrugBank](https://www.drugbank.ca) and filter out any drugs that do not have an InChIKey. [InChIKeys](https://en.wikipedia.org/wiki/International_Chemical_Identifier) are unique identifiers for small molecules. \n", "\n", "DrugBank provides more [detailed datasets](https://github.com/sbl-sdsc/mmtf-pyspark/blob/master/mmtfPyspark/datasets/drugBankDataset.py), e.g., subset of approved drugs, but a DrugBank username and password is required. For this tutorial we use the open DrugBank dataset." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", " | DrugBankID | \n", "AccessionNumbers | \n", "Commonname | \n", "CAS | \n", "UNII | \n", "Synonyms | \n", "StandardInChIKey | \n", "
---|---|---|---|---|---|---|---|
0 | \n", "DB00006 | \n", "BIOD00076 | BTD00076 | DB02351 | EXPT03302 | \n", "Bivalirudin | \n", "128270-60-0 | \n", "TN9BEX005G | \n", "Bivalirudin | Bivalirudina | Bivalirudinum | \n", "OIRCOABEOLEUMC-GEJPAHFPSA-N | \n", "
1 | \n", "DB00007 | \n", "BIOD00009 | BTD00009 | \n", "Leuprolide | \n", "53714-56-0 | \n", "EFY6W0M8TG | \n", "Leuprorelin | Leuprorelina | Leuproreline | Le... | \n", "GFIJNRVAKGFPGQ-LIJARHBVSA-N | \n", "
2 | \n", "DB00014 | \n", "BIOD00113 | BTD00113 | \n", "Goserelin | \n", "65807-02-5 | \n", "0F65R8P09N | \n", "Goserelin | Goserelina | \n", "BLCLNMBMMGCOAS-URPVMXJPSA-N | \n", "
3 | \n", "DB00027 | \n", "BIOD00036 | BTD00036 | \n", "Gramicidin D | \n", "1405-97-6 | \n", "5IE62321P4 | \n", "Bacillus brevis gramicidin D | Gramicidin | Gr... | \n", "NDAYQJDHGXTBJL-MWWSRJDJSA-N | \n", "
4 | \n", "DB00035 | \n", "BIOD00061 | BIOD00112 | BTD00061 | BTD00112 | \n", "Desmopressin | \n", "16679-58-6 | \n", "ENR1LLB0FP | \n", "1-(3-mercaptopropionic acid)-8-D-arginine-vaso... | \n", "NFLWUMRGJYTJIN-PNIOQBSNSA-N | \n", "