{ "cells": [ { "cell_type": "markdown", "id": "d0c5f0f3", "metadata": {}, "source": [ "This post looks at two things:\n", "\n", "1. Supplying your own parameters when normalizing molecules using the RDKit's `MolStandardize`\n", "2. Capturing and parsig the RDKit's C++ logs from within Python\n", "\n", "The RDKit's `MolStandardize` module includes functionality for \"normalizing\" input molecules: applying a set of chemical transformations expressed as reactions to standardize the representation of particular functional groups or substructures of the input molecule. This kind of normalization is a common step in preparing molecules for a compound registration system. `MolStandardize` has a collection of default transformations adapted from Matt Swain's [MolVS](https://molvs.readthedocs.io/en/latest/) (`MolStandardize` itself started as a C++ port of MolVS) which cover standardizing a variety of common functional groups. The full list is in the [RDKit source](https://github.com/rdkit/rdkit/blob/master/Code/GraphMol/MolStandardize/TransformCatalog/normalizations.in).\n", "\n", "Though we have tried to provide a broad and useful set of transformations, we know that they won't fit everybody's needs, so the `MolStandardize` allows the user to provide their own transformations. The first part of this post demonstrates how to do that.\n", "\n", "The normalization funcationality in `MolStandarize` sends information about what transformations have been applied to the console using the RDKit's logging functionality. This can be useful when working with small numbers of molecules in the shell or a notebook, and when debugging new transformations, but quickly becomes irritating when working with larger sets of molecules. In the second part of the post I'll show how to disable this logging information as well as how to capture it and parse the logs to find out which transformations were applied to each molecule.\n", "\n" ] }, { "cell_type": "code", "execution_count": 1, "id": "a6ccbee4", "metadata": { "ExecuteTime": { "end_time": "2024-02-23T04:54:00.013416Z", "start_time": "2024-02-23T04:53:58.694409Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2023.09.5\n" ] } ], "source": [ "from rdkit import Chem\n", "from rdkit.Chem.MolStandardize import rdMolStandardize\n", "from rdkit.Chem import Draw\n", "from rdkit.Chem.Draw import IPythonConsole\n", "\n", "import rdkit\n", "print(rdkit.__version__)" ] }, { "cell_type": "markdown", "id": "1db0c5ad", "metadata": {}, "source": [ "# Providing your own normalization transformations" ] }, { "cell_type": "markdown", "id": "9a236c0c", "metadata": {}, "source": [ "Let's start with a simple molecule to demonstrate the default behavor of the normalization code.\n", "\n", "The easiest way to apply the defaults to a molecule is to use the function `rdMolStandardize.Normalize()`:" ] }, { "cell_type": "code", "execution_count": 2, "id": "00fee046", "metadata": { "ExecuteTime": { "end_time": "2024-02-23T04:54:00.107390Z", "start_time": "2024-02-23T04:54:00.061640Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[05:54:00] Initializing Normalizer\n", "[05:54:00] Running Normalizer\n", "[05:54:00] Rule applied: Recombine 1,3-separated charges\n" ] }, { "data": { "image/png": "", "text/plain": [ "" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "m = Chem.MolFromSmiles('[O-]c1[n+](C)cccc1')\n", "m2 = rdMolStandardize.Normalize(m)\n", "Draw.MolsToGridImage([m,m2])" ] }, { "cell_type": "markdown", "id": "d279fa3b", "metadata": {}, "source": [ "An aside: as was mentioned in [another recent blog post](https://greglandrum.github.io/rdkit-blog/posts/2024-02-11-more-multithreading.html), it's possible to call `NormalizeInPlace()` to modify the molecule in place instead of making a copy and working on it. This will be a bit faster and can be used with mulitiple threads when working with more than one molecule." ] }, { "cell_type": "markdown", "id": "4869bd72", "metadata": {}, "source": [ "We can accomplish the same thing by creating a `Normalizer` object and using it to normalize the molecule:" ] }, { "cell_type": "code", "execution_count": 3, "id": "6978c1f3", "metadata": { "ExecuteTime": { "end_time": "2024-02-23T04:54:01.777584Z", "start_time": "2024-02-23T04:54:01.768867Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[05:54:01] Initializing Normalizer\n", "[05:54:01] Running Normalizer\n", "[05:54:01] Rule applied: Recombine 1,3-separated charges\n" ] }, { "data": { "image/png": "", "text/plain": [ "" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nrm = rdMolStandardize.Normalizer()\n", "m2 = nrm.normalize(m)\n", "Draw.MolsToGridImage([m,m2])" ] }, { "cell_type": "markdown", "id": "77833cba", "metadata": {}, "source": [ "Now let's look at another molecule where the normalization doesn't do anything:" ] }, { "cell_type": "code", "execution_count": 4, "id": "1c22d99a", "metadata": { "ExecuteTime": { "end_time": "2024-02-23T04:54:02.873397Z", "start_time": "2024-02-23T04:54:02.862326Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[05:54:02] Running Normalizer\n" ] }, { "data": { "image/png": "", "text/plain": [ "" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "m = Chem.MolFromSmiles('CC(C)(C)OC(=O)NCCC(=O)O[Na]')\n", "m2 = nrm.normalize(m)\n", "Draw.MolsToGridImage([m,m2])" ] }, { "cell_type": "markdown", "id": "503eae34", "metadata": {}, "source": [ "As a demonstration, let's construct a normalizer which will break the bond to the alkali metal and remove the Boc protecting group. This is, of course, an artificial example, but it shows how to provide new transformations to `MolStandardize`" ] }, { "cell_type": "code", "execution_count": 5, "id": "d1e6c4dc", "metadata": { "ExecuteTime": { "end_time": "2024-02-23T04:54:08.449768Z", "start_time": "2024-02-23T04:54:08.440610Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[05:54:08] Initializing Normalizer\n", "[05:54:08] Running Normalizer\n", "[05:54:08] Rule applied: remove_Boc\n", "[05:54:08] Rule applied: disconnect_alkali_metals\n" ] }, { "data": { "image/png": "", "text/plain": [ "" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# we provide the transformations in a string with one line per transformation.\n", "# Lines starting with // are comments and are ignored. Each line contains the\n", "# name of the transformation, a tab character, and then the reaction SMARTS for\n", "# the transformation itself. It's also possible to skip the names; in that case\n", "# just provide the reaction SMARTS (the tab is not needed)\n", "tfs = '''\n", "// this is from Deprotect.cpp:\n", "remove_Boc\\t[C;R0][C;R0]([C;R0])([O;R0][C;R0](=[O;R0])[NX3;H0,H1:1])C>>[N:1]\n", "// this should go last, because later transformations will\n", "// lose the alkali metal\n", "disconnect_alkali_metals\\t[Li,Na,K,Rb:1]-[A:2]>>([*+:1].[*-:2])\n", "'''\n", "# create the new Normalizer:\n", "cps = rdMolStandardize.CleanupParameters()\n", "nrm2 = rdMolStandardize.NormalizerFromData(tfs,cps)\n", "\n", "# and apply it:\n", "m2 = nrm2.normalize(m)\n", "Draw.MolsToGridImage([m,m2])" ] }, { "cell_type": "markdown", "id": "4e7cf098", "metadata": {}, "source": [ "If you're processing a lot of molecules, you probably don't want those log messages filling your console or notebook. It's possible to disable the logging:" ] }, { "cell_type": "code", "execution_count": 7, "id": "6caa1f97", "metadata": { "ExecuteTime": { "end_time": "2024-02-23T04:55:18.810931Z", "start_time": "2024-02-23T04:55:18.808343Z" } }, "outputs": [], "source": [ "from rdkit import RDLogger\n", "RDLogger.DisableLog('rdApp.info')\n", "m2 = nrm2.normalize(m)\n" ] }, { "cell_type": "markdown", "id": "b058feb5", "metadata": {}, "source": [ "# Connecting the RDKit logs to the Python logger" ] }, { "cell_type": "markdown", "id": "f14a635b", "metadata": {}, "source": [ "It's also possible to have the logging the RDKit's C++ backend does sent to Python's built-in logging facilities. This has been available since the 2022.03.1 release of the RDKit.\n", "\n", "I won't do a complete introduction to Python's logging capabilities here (I'm not an expert!), but here's a quick demonstration of what we might do:" ] }, { "cell_type": "code", "execution_count": null, "id": "f9a78506", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 19, "id": "d06c5bd9", "metadata": { "ExecuteTime": { "end_time": "2024-02-23T05:18:04.143889Z", "start_time": "2024-02-23T05:18:04.141575Z" } }, "outputs": [], "source": [ "import logging\n", "\n", "logger = logging.getLogger('rdkit')\n", "# set the log level for the default log handler (the one which sense output to the console/notebook):\n", "logger.handlers[0].setLevel(logging.WARN)\n", "# format the messages so that it's clear they come from the RDKit:\n", "logger.handlers[0].setFormatter(logging.Formatter('[RDKit] %(levelname)s:%(message)s'))\n", "\n", "# Tell the RDKit's C++ backend to log to use the python logger:\n", "from rdkit import rdBase\n", "rdBase.LogToPythonLogger()\n" ] }, { "cell_type": "markdown", "id": "07a634a0", "metadata": {}, "source": [ "Now do something which generates an error message:" ] }, { "cell_type": "code", "execution_count": 20, "id": "2a683682", "metadata": { "ExecuteTime": { "end_time": "2024-02-23T05:18:05.440976Z", "start_time": "2024-02-23T05:18:05.433570Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[RDKit] ERROR:[06:18:05] SMILES Parse Error: unclosed ring for input: 'C1C'\n" ] } ], "source": [ "Chem.MolFromSmiles('C1C')" ] }, { "cell_type": "markdown", "id": "14b60a95", "metadata": {}, "source": [ "You can tell that this went through the Python logger because it has the prefix we set up above." ] }, { "cell_type": "markdown", "id": "21f37a19", "metadata": {}, "source": [ "Since we're only logging at the WARN (and above) level, the INFO messages from doing normalizations don't appear:" ] }, { "cell_type": "code", "execution_count": 21, "id": "02b60d99", "metadata": { "ExecuteTime": { "end_time": "2024-02-23T05:18:07.039594Z", "start_time": "2024-02-23T05:18:07.037451Z" } }, "outputs": [], "source": [ "m2 = nrm2.normalize(m)" ] }, { "cell_type": "markdown", "id": "36581e55", "metadata": {}, "source": [ "Let's create an additional log message handler that catches INFO messages too but that saves them in a `StringIO` object instead of outputting them to the console:" ] }, { "cell_type": "code", "execution_count": 38, "id": "bb1924f7", "metadata": { "ExecuteTime": { "end_time": "2024-02-23T05:20:43.872130Z", "start_time": "2024-02-23T05:20:43.866249Z" } }, "outputs": [], "source": [ "from io import StringIO\n", "\n", "logger_sio = StringIO()\n", "# create a handler that uses the StringIO and set its log level:\n", "handler = logging.StreamHandler(logger_sio)\n", "handler.setLevel(logging.INFO)\n", "\n", "# add the handler to the Python logger:\n", "logger.addHandler(handler)\n", "\n", "# we also need to change the level of the main logger so that the INFO messages get sent to the handlers:\n", "logger.setLevel(logging.INFO)" ] }, { "cell_type": "markdown", "id": "4366b070", "metadata": {}, "source": [ "Now normalize a molecule:" ] }, { "cell_type": "code", "execution_count": 39, "id": "5de7716e", "metadata": { "ExecuteTime": { "end_time": "2024-02-23T05:20:44.775548Z", "start_time": "2024-02-23T05:20:44.353105Z" } }, "outputs": [], "source": [ "m2 = nrm2.normalize(m)" ] }, { "cell_type": "markdown", "id": "15644286", "metadata": {}, "source": [ "We don't see anything in the notebook because the main handler (the one which goes to the console/notebook) is still set to only show WARN and above, but `logger_sio` has the message:" ] }, { "cell_type": "code", "execution_count": 40, "id": "777aeb44", "metadata": { "ExecuteTime": { "end_time": "2024-02-23T05:20:44.900994Z", "start_time": "2024-02-23T05:20:44.898027Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[06:20:44] Running Normalizer\n", "[06:20:44] Rule applied: remove_Boc\n", "[06:20:44] Rule applied: disconnect_alkali_metals\n", "\n" ] } ], "source": [ "text = logger_sio.getvalue()\n", "print(text)" ] }, { "cell_type": "markdown", "id": "c62efa41", "metadata": {}, "source": [ "We can use a regular expression to extract the rules that were applied to the molecule from this output:" ] }, { "cell_type": "code", "execution_count": 41, "id": "62d1f102", "metadata": { "ExecuteTime": { "end_time": "2024-02-23T05:20:45.801092Z", "start_time": "2024-02-23T05:20:45.798300Z" } }, "outputs": [ { "data": { "text/plain": [ "['remove_Boc', 'disconnect_alkali_metals']" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import re\n", "rules = re.findall(r'Rule applied: (.*?)\\n',text)\n", "rules" ] }, { "cell_type": "markdown", "id": "faaff85f", "metadata": {}, "source": [ "If we now sanitize another molecule (or, in this case, the same one again), the log messages are appended to `logger_sio`:" ] }, { "cell_type": "code", "execution_count": 42, "id": "a4db1213", "metadata": { "ExecuteTime": { "end_time": "2024-02-23T14:22:05.051809Z", "start_time": "2024-02-23T14:22:05.048739Z" } }, "outputs": [], "source": [ "m2 = nrm2.normalize(m)" ] }, { "cell_type": "code", "execution_count": 43, "id": "5721e131", "metadata": { "ExecuteTime": { "end_time": "2024-02-23T14:22:12.047720Z", "start_time": "2024-02-23T14:22:12.040701Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[06:20:44] Running Normalizer\n", "[06:20:44] Rule applied: remove_Boc\n", "[06:20:44] Rule applied: disconnect_alkali_metals\n", "[15:22:05] Running Normalizer\n", "[15:22:05] Rule applied: remove_Boc\n", "[15:22:05] Rule applied: disconnect_alkali_metals\n", "\n" ] } ], "source": [ "text = logger_sio.getvalue()\n", "print(text)" ] }, { "cell_type": "markdown", "id": "cf5f6fa0", "metadata": {}, "source": [ "But we can reset the StringIO object:" ] }, { "cell_type": "code", "execution_count": 45, "id": "74d6d1e8", "metadata": { "ExecuteTime": { "end_time": "2024-02-23T14:23:19.703702Z", "start_time": "2024-02-23T14:23:19.700393Z" } }, "outputs": [ { "data": { "text/plain": [ "['disconnect_alkali_metals']" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "logger_sio.truncate(0)\n", "logger_sio.seek(0)\n", "\n", "# Create and normalize a new molecule:\n", "new_m = Chem.MolFromSmiles('CCCO[Na]')\n", "new_m2 = nrm2.normalize(new_m)\n", "\n", "# show the normalizations that were applied:\n", "re.findall(r'Rule applied: (.*?)\\n',logger_sio.getvalue())" ] }, { "cell_type": "markdown", "id": "b687b38c", "metadata": {}, "source": [ "We can now combine everything to write a function which applies a collection of custom transformations to a list of molecules and returns the transformed molecules along with which transformations were applied to each molecule:" ] }, { "cell_type": "code", "execution_count": 46, "id": "80d8aa63", "metadata": { "ExecuteTime": { "end_time": "2024-02-23T14:30:51.403809Z", "start_time": "2024-02-23T14:30:51.400558Z" } }, "outputs": [], "source": [ "import logging\n", "from io import StringIO\n", "from rdkit import rdBase\n", "import re\n", "\n", "def normalize_molecules(mols,standardization_rules):\n", " # Tell the RDKit's C++ backend to log to use the python logger:\n", " rdBase.LogToPythonLogger()\n", "\n", " logger = logging.getLogger('rdkit')\n", " # set the log level for the default log handler (the one which sense output to the console/notebook):\n", " logger.handlers[0].setLevel(logging.WARN)\n", "\n", " logger_sio = StringIO()\n", " # create a handler that uses the StringIO and set its log level:\n", " handler = logging.StreamHandler(logger_sio)\n", " handler.setLevel(logging.INFO)\n", " # add the handler to the Python logger:\n", " logger.addHandler(handler)\n", "\n", " \n", " # create the new Normalizer:\n", " cps = rdMolStandardize.CleanupParameters()\n", " nrm = rdMolStandardize.NormalizerFromData(standardization_rules,cps)\n", "\n", " match_expr = re.compile(r'Rule applied: (.*?)\\n')\n", " \n", " res_mols = []\n", " tfs_applied = []\n", " for mol in mols:\n", " mol = nrm.normalize(mol)\n", " res_mols.append(mol)\n", " \n", " text = logger_sio.getvalue()\n", " tfs_applied.append(match_expr.findall(text))\n", " \n", " logger_sio.truncate(0)\n", " logger_sio.seek(0)\n", "\n", " return res_mols,tfs_applied\n", "\n", " \n", " " ] }, { "cell_type": "code", "execution_count": 49, "id": "f35be9dd", "metadata": { "ExecuteTime": { "end_time": "2024-02-23T14:32:48.724955Z", "start_time": "2024-02-23T14:32:48.718344Z" } }, "outputs": [], "source": [ "tfs = '''\n", "// this is from Deprotect.cpp:\n", "remove_Boc\\t[C;R0][C;R0]([C;R0])([O;R0][C;R0](=[O;R0])[NX3;H0,H1:1])C>>[N:1]\n", "// this should go last, because later transformations will\n", "// lose the alkali metal\n", "disconnect_alkali_metals\\t[Li,Na,K,Rb:1]-[A:2]>>([*+:1].[*-:2])\n", "'''\n", "\n", "mols = [Chem.MolFromSmiles(smi) for smi in \n", " ('CC(C)(C)OC(=O)NCCC(=O)O[Na]','CCCO[Na]','CC(C)(C)OC(=O)NCCC','c1ccccc1')]\n", "tmols,tapplied = normalize_molecules(mols,tfs)" ] }, { "cell_type": "code", "execution_count": 53, "id": "7026d378", "metadata": { "ExecuteTime": { "end_time": "2024-02-23T14:33:30.949806Z", "start_time": "2024-02-23T14:33:30.937640Z" } }, "outputs": [ { "data": { "image/png": "", "text/plain": [ "" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Draw.MolsToGridImage(tmols,legends=['\\n'.join(x) for x in tapplied])" ] }, { "cell_type": "code", "execution_count": null, "id": "17aa6dd3", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.1" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 5 }