{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "I work with a bunch of different datasets. One big challenge with this is keeping track of which dataset is which and where they all are. Maybe you recognize the problem. One standard solution to this is to use a data catalog, and the PyData ecosystem happens to have a nice tool for working with data catalogs, [intake](https://intake.readthedocs.io/en/latest/overview.html). A couple of years ago I wrote a [blog post about using intake to handle some chemistry file formats](http://rdkit.blogspot.com/2019/07/using-intake-for-chemistry-data.html) and that inspired me to write a small intake plugin which added support for working with SDF and SMILES files using intake.\n", "\n", "I've been meaning to do a blog post describing the plugin and how to use it, but somehow never got around to it. However, this week we published a [preprint](https://chemrxiv.org/engage/chemrxiv/article-details/6406049e6642bf8c8f10e189) that includes some datasets (Aside: the preprint introduces SIMPD, an algorithm for splitting chemical datasets into training and test sets which differ from each other in the same ways that time splits constructed from medicinal chemistry project data do.) with an intake data catalog. The data sets and intake description are included in the [GitHub repo for the paper].(https://github.com/rinikerlab/molecular_time_series/tree/main/datasets). That's provided the impetus for me to finally do a quick blog post describing how to use intake for chemistry datasets.\n", "\n", "This is just a quick one, to show the basics of working with intake and the RDKit, but hopefully it's enough to make you want to start using the system. Feedback on either this post or the RDKit plugin for intake are very welcome.\n", "\n", "\n", "# Getting started\n", "\n", "The RDKit plugin is in GitHub: https://github.com/greglandrum/intake-rdkit\n", "I haven't setup packages for it yet, but you can install it directly from the repo like this:\n", "```\n", "python -m pip install git+https://github.com/greglandrum/intake-rdkit.git\n", "```\n", "You do need to have both the `intake` and `rdkit` packages as well." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "ExecuteTime": { "end_time": "2023-03-09T05:59:35.927410Z", "start_time": "2023-03-09T05:59:35.911730Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2022.09.5\n" ] } ], "source": [ "from rdkit import Chem\n", "from rdkit.Chem.Draw import IPythonConsole\n", "from rdkit.Chem import Draw\n", "from rdkit.Chem import PandasTools\n", "PandasTools.RenderImagesInAllDataFrames(True)\n", "import rdkit\n", "print(rdkit.__version__)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2023-03-09T05:38:41.624275Z", "start_time": "2023-03-09T05:38:41.027601Z" } }, "outputs": [], "source": [ "import intake\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Basics" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's start at the end: open the data catalog from github and then read one of the SIMPD datasets into a Pandas dataframe. Here I grab the data directly from the GitHub repo, but I could just as easily have worked from a local copy of the files:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "ExecuteTime": { "end_time": "2023-03-09T06:43:37.698716Z", "start_time": "2023-03-09T06:43:36.544297Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
molUnnamed: 0compound_chembl_idcanonical_smilesstandard_valuestandard_unitsstandard_relationstandard_typepchembl_valuepActivityactivesplit
0
\"Mol\"/
229CHEMBL1256953CN(C)c1cc2c(Nc3ccc4c(c3)ncn4Cc3ccccc3)ncnc2cn13162.28nM=IC505.55.51train
1
\"Mol\"/
398CHEMBL375270O=C(Nc1ccccc1)c1cc([N+](=O)[O-])ccc1Cl10000.00nM=IC505.05.00train
2
\"Mol\"/
280CHEMBL15192CC1(C)CCC2=C(O1)c1ccccc1C(=O)C2=O10000.00nM=IC505.05.00train
3
\"Mol\"/
422CHEMBL441618CN1CCc2ccccc2Cc2[nH]c3ccccc3c2CC15011.87nM=IC505.35.31train
4
\"Mol\"/
263CHEMBL1393CC(=O)S[C@@H]1CC2=CC(=O)CC[C@]2(C)[C@H]2CC[C@@...12589.25nM=IC504.94.90test
\n", "
" ], "text/plain": [ " mol Unnamed: 0 \\\n", "0 229 \n", "1 398 \n", "2 280 \n", "3 422 \n", "4 263 \n", "\n", " compound_chembl_id canonical_smiles \\\n", "0 CHEMBL1256953 CN(C)c1cc2c(Nc3ccc4c(c3)ncn4Cc3ccccc3)ncnc2cn1 \n", "1 CHEMBL375270 O=C(Nc1ccccc1)c1cc([N+](=O)[O-])ccc1Cl \n", "2 CHEMBL15192 CC1(C)CCC2=C(O1)c1ccccc1C(=O)C2=O \n", "3 CHEMBL441618 CN1CCc2ccccc2Cc2[nH]c3ccccc3c2CC1 \n", "4 CHEMBL1393 CC(=O)S[C@@H]1CC2=CC(=O)CC[C@]2(C)[C@H]2CC[C@@... \n", "\n", " standard_value standard_units standard_relation standard_type \\\n", "0 3162.28 nM = IC50 \n", "1 10000.00 nM = IC50 \n", "2 10000.00 nM = IC50 \n", "3 5011.87 nM = IC50 \n", "4 12589.25 nM = IC50 \n", "\n", " pchembl_value pActivity active split \n", "0 5.5 5.5 1 train \n", "1 5.0 5.0 0 train \n", "2 5.0 5.0 0 train \n", "3 5.3 5.3 1 train \n", "4 4.9 4.9 0 test " ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rawbase = 'https://raw.githubusercontent.com/rinikerlab/molecular_time_series/main'\n", "catalog = intake.open_catalog(f'{rawbase}/datasets/public_data.yaml')\n", "ds = catalog.SIMPD.CHEMBL1267245\n", "df = ds.read()\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's a pandas `DataFrame`, so we can take advantage of all that they support. Here we retrieve rows that are marked as active and which have a aromatic five-ring containing an N:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "ExecuteTime": { "end_time": "2023-03-09T06:02:33.967161Z", "start_time": "2023-03-09T06:02:33.889037Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
molUnnamed: 0compound_chembl_idcanonical_smilesstandard_valuestandard_unitsstandard_relationstandard_typepchembl_valuepActivityactivesplit
3
\"Mol\"/
422CHEMBL441618CN1CCc2ccccc2Cc2[nH]c3ccccc3c2CC15011.87nM=IC505.305.3000001train
20
\"Mol\"/
39CHEMBL1200517CN1C[C@H](C(=O)N[C@]2(C)O[C@@]3(O)[C@@H]4CCCN4...2254.00nM=IC505.655.6470461train
26
\"Mol\"/
77CHEMBL123Cc1c2ccncc2c(C)c2c1[nH]c1ccccc120.56nM=IC509.259.2518121train
31
\"Mol\"/
490CHEMBL605003CN(C)S(=O)(=O)c1ccc2c(c1)/C(=C/c1cc3c([nH]1)CC...3981.07nM=IC505.405.4000001test
40
\"Mol\"/
507CHEMBL772COC(=O)[C@H]1[C@H]2C[C@@H]3c4[nH]c5cc(OC)ccc5c...1995.26nM=IC505.705.7000011train
\n", "
" ], "text/plain": [ " mol Unnamed: 0 \\\n", "3 422 \n", "20 39 \n", "26 77 \n", "31 490 \n", "40 507 \n", "\n", " compound_chembl_id canonical_smiles \\\n", "3 CHEMBL441618 CN1CCc2ccccc2Cc2[nH]c3ccccc3c2CC1 \n", "20 CHEMBL1200517 CN1C[C@H](C(=O)N[C@]2(C)O[C@@]3(O)[C@@H]4CCCN4... \n", "26 CHEMBL123 Cc1c2ccncc2c(C)c2c1[nH]c1ccccc12 \n", "31 CHEMBL605003 CN(C)S(=O)(=O)c1ccc2c(c1)/C(=C/c1cc3c([nH]1)CC... \n", "40 CHEMBL772 COC(=O)[C@H]1[C@H]2C[C@@H]3c4[nH]c5cc(OC)ccc5c... \n", "\n", " standard_value standard_units standard_relation standard_type \\\n", "3 5011.87 nM = IC50 \n", "20 2254.00 nM = IC50 \n", "26 0.56 nM = IC50 \n", "31 3981.07 nM = IC50 \n", "40 1995.26 nM = IC50 \n", "\n", " pchembl_value pActivity active split \n", "3 5.30 5.300000 1 train \n", "20 5.65 5.647046 1 train \n", "26 9.25 9.251812 1 train \n", "31 5.40 5.400000 1 test \n", "40 5.70 5.700001 1 train " ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mols = df[(df.active==1) & (df.mol>=Chem.MolFromSmarts('c1cccn1'))]\n", "mols.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the previous example I directly accessed a particular dataset that I knew was there.\n", "\n", "I could have also looped over the available datasets:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 20, "metadata": { "ExecuteTime": { "end_time": "2023-03-09T06:49:39.922076Z", "start_time": "2023-03-09T06:49:39.674761Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CHEMBL1267247\n", "CHEMBL3705464\n", "CHEMBL3705282\n", "CHEMBL1267250\n", "CHEMBL3705791\n" ] } ], "source": [ "catalog = intake.open_catalog(f'{rawbase}/datasets/public_data.yaml')\n", "for i,assay in enumerate(catalog.SIMPD):\n", " print(assay)\n", " if i>=4: break" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Metadata" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The individual datasets have metadata in the catalog:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "ExecuteTime": { "end_time": "2023-03-09T06:49:10.924672Z", "start_time": "2023-03-09T06:49:10.666573Z" } }, "outputs": [ { "data": { "text/plain": [ "dict_keys(['url', 'source', 'activity_type', 'min pchembl_value', 'max pchembl_value', 'median pchembl_value', 'is_log_data', 'activity bin', 'Compounds', 'Num Active', 'Num Inactive', 'catalog_dir'])" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "catalog.SIMPD[assay].metadata.keys()" ] }, { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2023-03-09T05:39:02.440157Z", "start_time": "2023-03-09T05:39:02.429165Z" } }, "source": [ "And we can use that as a filter. Note that commands like this take a few seconds to run when accessing the catalog from github. Using a local copy is much faster." ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "ExecuteTime": { "end_time": "2023-03-09T06:53:06.338603Z", "start_time": "2023-03-09T06:52:57.739455Z" } }, "outputs": [ { "data": { "text/plain": [ "['CHEMBL3705362', 'CHEMBL3705899', 'CHEMBL3706373']" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "large_assays = [assay for assay in catalog.SIMPD if catalog.SIMPD[assay].metadata['Compounds']>700]\n", "large_assays" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "ExecuteTime": { "end_time": "2023-03-09T06:54:31.358144Z", "start_time": "2023-03-09T06:54:30.804041Z" } }, "outputs": [ { "data": { "text/plain": [ "{'url': 'https://www.ebi.ac.uk/chembl/assay_report_card/CHEMBL3705899/',\n", " 'source': 'ChEMBL_30',\n", " 'activity_type': 'IC50',\n", " 'min pchembl_value': 4.61,\n", " 'max pchembl_value': 9.0,\n", " 'median pchembl_value': 6.96,\n", " 'is_log_data': True,\n", " 'activity bin': 7.2,\n", " 'Compounds': 766,\n", " 'Num Active': 304,\n", " 'Num Inactive': 462,\n", " 'catalog_dir': 'https://raw.githubusercontent.com/rinikerlab/molecular_time_series/main/datasets'}" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "catalog.SIMPD['CHEMBL3705899'].metadata" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Of course the catalogs themselves also have metadata:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "ExecuteTime": { "end_time": "2023-03-09T06:56:38.800608Z", "start_time": "2023-03-09T06:56:38.797032Z" } }, "outputs": [ { "data": { "text/plain": [ "{'version': 1,\n", " 'creator': {'name': 'greg landrum', 'email': 'glandrum@ethz.ch'},\n", " 'summary': 'Collection of datasets from the publication\\nG.A. Landrum, M. Beckers, J. Lanini, N. Schneider, N. Stiefl, S. Riniker \\n\"SIMPD: an Algorithm for Generating Simulated Time Splits for Validating Machine Learning Approaches\"\\nhttps://chemrxiv.org/engage/chemrxiv/article-details/6406049e6642bf8c8f10e189\\n\\nPlease cite our paper if you use these datasets. \\n'}" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "catalog.metadata" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "ExecuteTime": { "end_time": "2023-03-09T06:56:48.795639Z", "start_time": "2023-03-09T06:56:48.548899Z" } }, "outputs": [ { "data": { "text/plain": [ "{'version': 1,\n", " 'creator': {'name': 'greg landrum', 'email': 'glandrum@ethz.ch'},\n", " 'summary': 'Collection of datasets with pchembl_values for bioactivity prediction.\\n\\nEach row includes the reported value. Only values without data_validity_comments are included\\nActive/inactive class assignments were done to give a 40/60 inactive/active ratio\\nThe suggested train/test split was created using the SIMPD algorithm, described in the publication\\nhttps://chemrxiv.org/engage/chemrxiv/article-details/6406049e6642bf8c8f10e189\\n',\n", " 'split_column': 'split',\n", " 'class_column': 'active',\n", " 'activity_column': 'standard_value'}" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "catalog.SIMPD.metadata" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Supported formats" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The RDKit intake integration currently supports two formats:\n", "\n", " - SDF\n", " - SMILES: this can be any delimited text file which includes a column with SMILES\n", " \n", " Either format can be read from either compressed or uncompressed files." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Creating data descriptions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Catalogs and datasets in intake are described in YAML files. You can, of course, create these by hand, but it's certainly more efficient to create them programmatically. Here's the code I used to do this for the raw ChEMBL assays in our preprint, derived from one of the notebooks in the [GitHub repo](https://github.com/rinikerlab/molecular_time_series/blob/main/01_Get_ChEMBL30_Bioactivity_Data_assays.ipynb):\n", "\n", "```\n", "yaml=[]\n", "# loop over assay IDs:\n", "for aid,achemblid in zip(aids,achemblids):\n", " # retrieve the assay data from ChEMBL:\n", " d = %sql postgresql://localhost/chembl_30 \\\n", "select distinct on (cil.chembl_id) cil.chembl_id compound_chembl_id,canonical_smiles, \\\n", " standard_value,standard_units,standard_relation,standard_type,pchembl_value \\\n", " from activities join assays using(assay_id) \\\n", " join compound_structures using(molregno) \\\n", " join chembl_id_lookup cil on (molregno=entity_id and entity_type='COMPOUND') \\\n", " where pchembl_value is not null \\\n", " and standard_value is not null and standard_units = 'nM' \\\n", " and data_validity_comment is null \\\n", " and assay_id=:aid \\\n", " order by cil.chembl_id;\n", " # convert that to a pandas dataframe and write a CSV file:\n", " df = d.DataFrame()\n", " df.to_csv(f'./datasets/source_data/assay_{achemblid}.csv.gz',index=False)\n", " \n", " # generate the metadata we care about:\n", " minAct = min(df.pchembl_value)\n", " maxAct = max(df.pchembl_value)\n", " medAct = np.median(df.pchembl_value)\n", " actTypes = list(set(df.standard_type))\n", " if len(actTypes)>1:\n", " print(f'SKIPPING {achemblid} since it has {len(actTypes)} different activity types.')\n", " continue\n", " actType = actTypes[0]\n", " assayd = %sql postgresql://localhost/chembl_30 \\\n", " select * from assays \\\n", " where assay_id=:aid\n", " assayd = dict(assayd[0])\n", " # quotes in the description cause problems with the yaml parsing\n", " assayd['description'] = assayd['description'].replace('\"','')\n", " \n", " # create the YAML for this dataset:\n", " template=f''' {achemblid}:\n", " description: \"Assay {achemblid}: {assayd['description']}\"\n", " args:\n", " filename: '{{{{ CATALOG_DIR }}}}/source_data/assay_{achemblid}.csv.gz'\n", " smilesColumn: canonical_smiles\n", " metadata:\n", " url: https://www.ebi.ac.uk/chembl/assay_report_card/{achemblid}/\n", " source: ChEMBL_30\n", " activity_type: {actType}\n", " min pchembl_value: {minAct:.2f}\n", " max pchembl_value: {maxAct:.2f}\n", " median pchembl_value: {medAct:.2f}\n", " driver: intake_rdkit.smiles.SmilesSource\n", "'''\n", " yaml.append(template)\n", "\n", "# print the YAML to the notebook\n", "print('\\n'.join(yaml))\n", "\n", "# and write the full data catalog file:\n", "with open('./datasets/assays.yaml','w+') as outf:\n", " header='''metadata:\n", " version: 1\n", " creator: \n", " name: greg landrum\n", " email: glandrum@ethz.ch\n", "\n", " summary: |\n", " Collection of datasets with pchembl_values for bioactivity prediction.\n", "\n", " Each row includes the reported value. Only values without data_validity_comments are included\n", " \n", "sources:'''\n", " print(header,file=outf)\n", " print('\\n'.join(yaml),file=outf)```\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.9" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }