{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "A [question asked on Mastodon](https://aus.social/@polymerreaction/109543412170217264) made me realize that we don't have a tutorial anywhere on descriptor calculation. Here's a first pass at doing that. This will eventually end up in the RDKit documentation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Start by doing the usual imports" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "ExecuteTime": { "end_time": "2022-12-20T05:18:53.601332Z", "start_time": "2022-12-20T05:18:53.477333Z" } }, "outputs": [ { "data": { "text/plain": [ "'2022.09.1'" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from rdkit import Chem\n", "import rdkit\n", "rdkit.__version__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A test molecule:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2022-12-20T05:18:54.571085Z", "start_time": "2022-12-20T05:18:54.561585Z" } }, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAcIAAACWCAIAAADCEh9HAAAABmJLR0QA/wD/AP+gvaeTAAAgAElEQVR4nO3dZ1iTVxsH8H8SwpSNCkJFhouCG1QUq7U4ioJaRWvdAyyvWutC5NVqVRCqVbRKBVelFSvuWnEg4EBbLYoCzjJcDGUHZGSc90N8rbVqgTwhjPt38QFDcp87vdI75zzPGTzGGAghhNQWX9UJEEJIw0ZllBBCFEJllBBCFEJllBBCFEJllBBCFKKm6gRIkyZmLKOiQv67iZqakVCo2nwIqQUqo0SVcqqq/NLTBxoYAHDS06MyShoiKqNExaw1NX3MzVWdBSG1R2WUqNjt589XZGYCmNWqlam6uqrTIaTGqIwSFbPR0pprYQFAXyBQdS6E1AaVUaJi6jyekRp9DkkDRhOeiCoJAC0+fQhJw8ajrUlIfSADzhYW5lZVTWjZUtW5EFIzNJgi9cKDigr/9HQ1Pn+IkZEJTXsiDQqNp0i9YKWp2d/QsEomi8jNVXUuhNQMDepJfXGvvPyzW7c0+PxjDg4K3nQqkUiuikTy3ztoa5traHCRICFvRr1RUl+009LqZ2BQIZP9qHCHNLOycv+zZ89lsucymYQ6CkTJqIySesTLzIwHRD19WiiRKBjKUkNjuLHxcGNjS01NTnIj5G2ojJJ6pL22dh99/XKZLFLhDmlCScncP/+c++efxQpXZELejcooqV9mmJkB+PnZs5Kalz8pY7/m5y9KS2NAHz29Tba2m2xt9WluP1EyKqOkfrHX0XHV1DSOjd0YHFz9V0kYO5aX90lq6leZmXFFRTdKS5WXISGvoTv1pN65fPmys7Ozvr5+ZmamgYHBu58sYexUQcH27OxHlZUAWmloTDE1tdfWTigpmWpqWif5kqaOyiipjwYOHBgbG/v1118vW7bsbc8RM3a6oCA8O/txZSUAa03NyaamQ4yMBDxeHWZKCJVRUi/Fx8cPGDDAwMAgMzNTX1//tb+Wl5eHhYXd6t49UUsLgK2W1nQzs4GGhnSJiqgEffBIfdS/f/9+/foVFRVt3br11cfLyspCQkJsbW3nzZt3ac2aDtra39jYRNrZuVINJapDvVFST8XExLi6uhobG2dkZOjq6paWlu7YsSMoKCg7OxtA586d/f39R48ezaMhPFE1KqOk/nJxcbl48eKqVat0dXUDAwNzc3MB9O7d28/Pb9iwYVRAST1BZZTUXwcPHhw9ejSfz5fJZABcXFyWLVvm6uqq6rwI+RuamUzqo/z8/M2bN4eEhABQU1MzNzcPCQkZPny4qvMi5A2oN0rql6dPn65fv37r1q2lpaUAunbtev36dVNT0/T0dC0tLVVnR8gb0O1NUl88e/ZsyZIlVlZWwcHBpaWlH3300eXLl69du+bo6JiTk7N9+3ZVJ0jIm1FvlKjeU7F4z+HDy6dMKS8v5/F4w4cPX7ZsWY8ePeR/PXLkyMiRI83MzNLT0zVpuyZS/1BvlKhSTlXVukePRqakXGjbtkoiGTZs2JUrV44ePfqyhgLw8PDo3Llzdnb27t27VZcpIW9FvVGiGlmVlbtyco7n54sZ4wMfGRpOEgg6WFq+8clRUVGenp6tW7e+f/++urp6HadKyLtRGSVKd0UkAmMATIRCay2trKqqvbm5h549q2KMD3xoaOjdqpXVO0frjLHOnTsnJyeHh4fPmDGjrhInpFqojBKlG3rz5iRTUwCWmpqnCwqiCwqkjAl4vKFGRtNMTVtX73JnZGTk+PHjra2t7969q0ZbiJL6hK6NEqXTV1P7tEWLT1u0cNbTK5fJeMDHxsZRdnYr2rSpZg0F4Onp2aFDh/T09B9//FGp2RJSU9QbJUr3QVJSl2bNAEw1NW0hFILHa1Wr65sRERGTJk2ysbG5c+cOdUhJ/UFllCjduFu39tnZKR5HKpXa2dndu3cvIiJiwoQJigckhBM0qCcNhkAg8PX1BbBmzRr5KntC6gMqo0TpajeEf6OJEydaWVnduXMnKiqKq5iEKIgG9aSBCQsL8/b2trOzS05O5vOpH0BUj8ooaWDEYnG7du0yMzMPHDjwySefqDodQmhQTxoaoVC4aNEiAKtWraJOAKkPqDdKGp7KykpbW9vHjx8fOXLEw8ND1emQpo56o6Th0dDQkHdIV65cSf0AonLUGyUNUkVFhY2NTVZW1vLly11dXQ3/j7Z2JnWPyihpqCZMmHDmzJmnT5++9rimpqahoWGrVq3MzMwM387IyIh2LyWcoDJKGqSsrCx7e/vCwsIePXpoamoWFBQUFhYWFhZWVFRUP4hAILCysoqKiurSpYvyUiWNHpVR0iCNHDnyyJEjbm5ux48ff/Xx8vLywmqT19xevXpdvnxZRe+DNAZURknDs3v37qlTpxoYGCQnJ6empvbp06dZs2a1iHPq1Ck3Nzc+n3/37l0rKyvO8yRNBN2pJw1MVlbW/PnzAWzevLm4uFh+xEhJSUlN45SXl7u4uIwfP14sFgcHByshU9JUUBklDczMmTMLCwuHDx8+bty4qVOnVlZWDh48WE9Pr0ZBDhw4YG1tHRISsmzZMoFAsGPHjgcPHigpYdLoURklDcnOnTtPnDhhYGAQGhq6du3aq1evtmnTJigoqKZxDAwMcnJy1q9fb2pq6unpKRaL161bp4yESVNA10ZJg/HkyRMHB4fCwsKffvqpS5cu3bt3r6ysPHny5KBBg2oRzcXF5eLFi8HBwcOGDbO3txcKhWlpaebm5pynTRo96o2SBkM+nHd3d/f09Jw8eXJFRYWPj0/taigAf39/AOvXr7e0tBw1alRlZSV1SEntUG+UNAzh4eFeXl7GxsYpKSnbtm1bsWKFlZXVjRs3dHV1ax2zZ8+eV65c+fbbbwcNGtSpUyd1dfX09HQzMzMO0yZNAfVGSQPw5MmTxYsXA9iyZUtubm5AQACfz9+1a5ciNRTAf//7XwDBwcHW1tbu7u4VFRUbNmzgJmPSpDBC6jeZTDZkyBAAHh4eYrG4e/fuAObOnctJ8B49egDYvHnztWvXeDyejo5Obm4uJ5FJ00FllNR333//PQATE5OcnJzly5cDsLKyEolEnAQ/dOgQAAsLi4qKimHDhgHw8/PjJDJpOujaKKnXHjx44ODgIBKJ9u/f37ZtWycnJ6lUGhsb+8EHH3ASnzHWtWvXGzduhIaGOjo6Ojo66ujoZGRkmJiYcBKfNAV0bZQowfffw80N7u6YNQsiUa3DMMa8vLxEItGIESM8PDwmT54sFovnzp3LVQ0FwOPx5LfsAwMDHRwcBg0aVFpaunnzZq7ikyZBxb1h0vicPctGjGBiMWOMhYez//yn1pG2bNkCwMTEJDc3V347yNramqvh/EsymczBwQFAeHj4pUuXAOjr6xcUFHDbCmnEaFBPuLZgAQYMwLBhACCTwc4Ohw9j+3YYGr75Ryh8Y5jMzMxOnTqJRKKoqCgbG5uePXtKpdL4+HgXFxfOU46MjBw/fry1tfXdu3eHDBly9uzZlStXyq/DEvKvqIwSrs2aBU9PfPjhi3+2a4fAQIwe/cbnFnTvbnn37mu7KRsaGhoYGBw4cODGjRvjxo3bvXu3o6NjcnLyggULlDRDXiqV2tvb37lzZ9euXW3btu3bt6+BgUFGRoaBgYEymiONDJVRwrUtWyASYckSALhzB3PmYMsW/PILCgv/+fOwfXvLixffGMbCwqK4uDg9Pb2kpOTjjz9mjCUlJSnvjJCIiIhJkybZ2NjcuXNn4MCB58+fDwgI8PPzU1JzpDGhMkq4Vl4ONzf064eWLfHjj9i4EY6O73i6SCT654bKT58+DQ0NFYlEJ0+eHDx4cHl5+ZMnT2xtbZWXtVQqtbOzu3fvXkREhKmpqaurq7GxcUZGhoIz/EmToNIrs6TRSUxk+flMLGYXL7LoaJafX+tIa9euBdC7d28Os3u3nTt3AnB0dGSMOTk5AXB3d6+z1knDRROeCHeKiuDhAXt7ZGSgTx8MGQIjo1oHmz17dvPmzS9fvhwbG8thju8wYcKEwMDA6OhoAObm5gKBIC8vr26aJg0alVHCnS+/xOPHaNMG1taKB9PR0fniiy8ArFq1SvFo1SEUCpcsWWJsbPzkyZO4uDipVDpmzBgO4xcXFx87diwhIYHDmKReUHV3mDQWx48zgGlqstRUrkIWFxcbGRkBOHfuHFcx/5VMJhs8eDAADw8PbiPLV7WOHTuW27BE5ag3SrhQXIxZswAgIAB2dlxF1dPTmzt3LuqwQwpg27Ztp06dMjEx2bZtG7eRBwwYACAuLo7Rfd3Ghe7UEy5MmoSICPTujQsXIBBwGLi4uLhNmzZFRUUXLlzo27cvh5Hf6OWc//3793M7opezsLB48uTJrVu3OnbsyHlwoirUGyUKO34cERHQ1sYPP3BbQwHo6+vPnj0bQEBAALeR/4kx5u3tLRKJxo4dq4waCkC+G0BcXJwyghNVoTJKFFJQUJASGAgAAQFo21YZTXz55Ze6urrR0dFXrlxRRvyXtm7devr0aRMTk02bNimpif79+wOIj49XUnyiElRGiUK++OKLHomJe6dNw5w5SmrCyMjoP//5D4DVq1crqQkAmZnYvt2lQ4duoaGhLVq0UFIr8suj8fHxdDGtMaFro6T2fvnlF3d3d21t7aSkpLbK6YrK5efnt2nTprS09OrVq/L96rklk2HgQMTH47PPJD/+qMZ5/FdZWlo+fPgwOTnZ3t5eqQ2ROkO9UVJL+fn5Xl5eAIKCgpRaQwEYGxt7e3tDaVdIt2xBfDyaN8e33yq3hoIujzZG1BsltTR+/PjIyMg+ffqcP3+ez1f693Fubq61tXV5eXlSUlKnTp04jJyRgU6dUFqKgwcxahSHgd8sMjJp06YWtrbGEREaSm+M1Akqo03RoUOHsrOznZycXm5PV9M6eOzYMQ8PDx0dnaSkJKXuGPKqefPmhYSEeHp6/vzzz1zFlMnw4Yc4dw4TJiAigquo75KRAWtrGBnh2TMo/9uH1AUqo02Oq6trTEzMaw9qamoaVk/z5s2Li4vt7e1zc3O3bNni4+NTZ5lnZ2fb2NhUVlbeuHGDqwuLGzfiyy9hZoaUFEU2AKgZKytkZiIpCZ0711GLRLlUtn6KqEJiYqKamhoAS0tLR0dHW1tbY2PjmnZF5Zt+9uvXTyaT1XH+8lv248eP5yRaWhpr1owB7NAhTuJV15QpDGAbN9Zpo0R5qDfahFRVVfXo0SM5OfmLL77YuHHjq38qLy//56afb5SXlycWi4VC4aJFi9asWVPHb+HRo0e2trZSqTQ1NbV9+/YKRsvIwOTJaNMGe/Zwkl11/fADpkyBhweOHKnTdomSUBltQvz9/QMCAtq1ayffRn748OGJiYnVGcgbGxtraPx1PyQqKsrT07Nly5YZGRnK247+jUpLS83NzUtKSgYNGnTq1KlaRNi3D3l5mD0bAIKDMXkydHWhrc1xnu/26BFat4aBAfLyOF/2RVRA6dM7SD1x/fr1b775hs/nb9++XV77cnJysrOzs7Ozq/NyHR0deUmNjY0dM2aMo6Pj1atXw8PD5VuH1BlfX9+SkhIAZ86c0dPTa968+T+Lvrn5EA2Nzq+dm/dSSgp++AH9+8PeHpcuYexYtGxZl+8AAN57D9bWSE/HjRvo1q2uWmUM69fj3Dkwhk6dsGIF1NXrqu1GjnqjTUJlZWWPHj1SUlIWLlz4zTffyB+s/kC+oKCgsrLy5as0NTWPHj06YsQIMzOztLS0OuuQxsXFDRw4UCgUjho16pdffikrK3vj0/r3j4mPH/jag/Ji2qsXrKxgaorDhxETg5EjERICS0vlp/4PM2Zgxw6sW4cFC+qqyb17ERuL8HDweFi+HDo68PWtq7YbOeqNNgkrVqxISUlp3779119//fJBLS0tLS2tVq1aVSdCWVmZvKRqamoCcHd37969e2Ji4q5du+rmZn1ZWdnMmTMZY8uXL58+ffrp06fLysqCgoIGDhz4WtEXCo2bN3/jAXpo3RpWVmjbFj17YteuOsj6rQYMwIEDeP68Dpv85RcsXAgeDwBmz8aoUVRGOaPaO1ykDiQmJgqFQj6ff+HCBQ7DHjhwAMB7771XWVnJYdi3ka9i6tq1a1VV1ahRowC4urpWf6pAQQFLS2P37zN/f3byJCsrY05ObMAAlpmp1Kzf6uxZdvXqi9+PHmVisXKakUrZd9+xAwcYY2zYMHb79ovHnz9nnTopp8mmiMpoI1dRUfH+++8DWLx4MbeRZTKZg4MDgLCwMG4j/9PZs2d5PJ6GhsbNmzf37NkDQF9f/+HDh7UIJS+jjLEjRxiPp7Iy6u3N2rVjpaWMMdanz4tfOHbjBuvViwGsZUsmEjE/PxYR8eJP8fFswgQlNNlEURlt5BYtWgSgQ4cOz58/5zx4ZGQkAEtLS6V2SIuLi1u3bg0gMDAwKytLfqzI7t27axft3DmWnv7i9++/ZxkZXKVZM97ezNeXLVzImDLK6PPn7KuvmLo6A5iZGfvhByaVspwc1rs3W7+ebdnCevVid+5w2mSTRmW0Mbt8+bJAIBAIBL///rsy4kulUvlqop07dyojvlzBwoUTO3Z0cnISi8UjRowA4ObmpnjY/Hzm6srMzVlFheLBaszbmyUmMldXduPGizLK2VKG+HjWvj0DGJ/PvLxYcTFLSGAODiw2lpWVsTNnWHQ0KyzkqDHCGJVR5SkoYOfOsZs32aNHyhmy/ZuKigo7OzsAfn5+ymslIiICgI2NjVhJl/dOnWI8HtPRyb19e9euXQAMDAwePXqkeGCZjHXvzgC2daviwWpMXkZTU9lHHzFnZ3b3LrOwYL6+f/WUayM/n3l5MR6PAaxTJ3b5MisqYp9/zvh8BrDhwznLnvwdlVHO/PorKyhgjDGJhJ09y06cYMBfP+rqrGVL1qEDc3Zmbm5s4sRJc+bMWb58+YYNG3bv3n306NHz588nJyc/fvy4qqqKk3wWLFgAoGPHjuXl5ZwEfCOJRCJfTbRnzx7uoxcXs9atGcCCgtiTJ2Xdu0/s2DHi5QU+hUVFMYC99x6rk5tkfyMvo4wxX1+mpcWCgl58Tvh85uZW45tOMpls586dCUOHMoBpa7OgICYWs2PHmIUFA5hQyHx9mTI/Bk0clVHOtGvHZs5kjLGyMubszOLiWL9+zMGBWVgwHZ2/ldRmzWTvmDsh3xpdQZcuXRIIBGpqaleuXFE82rvJO4lt27aVSCQch542jQGsZ08mkbCPP2aAbORIDsPLZMzBgQFM+TfJ/pKXx0QitmEDu3ePMcZKS5m7OysvZ3/8wby8mLb2iw+JmVl1O6d37tyRn01ioKkp+vRTlpbGnjxhn3zyIpCzM0tJUfabauKojHLG0ZGNH88uXnxRRl/r4FRUsOxsdusWS0hgJ05URkREbNq0aeXKlfPmzZs8ebK7u7uLi4u9vb25ufnNmzcVzKS8vFw+nPf391cwVHVIJBL5ts2RkZFcxj19mvF4TEODpaSwHTsYwAwM2OPHXDbBWGQkA5ilZR11SCsrWd++rHNn9rZZBs+esfXrX1zbBJhAwD7//ObRo0ff+BVVVVW1du1a+Uzeli1b/vDDDxKJRLZhw4sNVwwM2PffM6lUuW+JUBnlkKMje/SI9erFioqYszNbtYrxeMzIiNnYsB49mKsr8/Rk3t5syRK2aVNxWFhYVFRUTEzMtWvXMjIyioqKOMxk3rx5AOzs7JQ6nH9VWFiYvEUpV//TFhWx995jAPvmG/b4MTM0ZAD76Sdugr9CKmX29gxgyrxJ9peZMxnAWrX696+Dl53T9u3HAjAzM/P19U1/pXd64cIF+Zclj8ebOHHis2fPkpKSnJyc7vXtywA2bBjj4goyqQ4qo5xxdGSMseBgtmYNc3ZmS5f+bSD/6k+XLoX/HMvz+XxjY+OWLVvq6ur2799/7Nixs2bN8vPzCw4ODg8PP3DgwNmzZ6tTc18O56++nN6tfFVVVW3atAEQFRXFTUT5XnK9ejGJhMkv+bm7cxP5HyIiGMBsbJQ2B/7/goMZwLS0WPUvtDx7Jl63bt3LvawEAoGbm9vevXvnzJkj397Q1tY2JiamtLR0/vz58i0QB3TqJDt+XJnvg7yO1tRzxskJV65ALEa/fuDzkZAAxt68JJGxrPT0r15bwlhcXAyAz+fLZO+6cirH5/PfuBVTs2bNQkNDHz9+vHz58pUrVyr/Tf8lNDTUx8fn/fffv3nzpqJniiQno3NnaGri+nWcPw8vLxgbIyUFpqYcJfs3Uik6dsTTpyw8/N6YMYpuvvc2J09i2DDIZIiMxNixNXstY+zcuXNhYWGHDh16ubmBQCCYP3/+qlWrzp496+Pj8+DBAzU1NR8fn9WrV+vq6nL/BsjbURnlzKRJL7atPH8eu3fD0BDXr+O1fYbkP8bGBfr6+fLCJ/j/RmkymaywsDAtLe327dumpqZFRUXv2CtEXnPfqHXr1lVVVQ8ePFBXV79w4YKenl7nOtljvaqqqm3btg8fPjxy5IiHh4ei4c6cwaNHGDwY9vYoKkJkJMaN4yLNN9u//+7Mma5mZtqpqakCJWxdd+sWnJ1RXIzVq+HvX/s4RUVF+/fv37x5c0pKirq6uqurq5GRkXzOWZcuXcLCwhwdHTlLmlSfinvDjVe/fm8d1Pfvf/Xlf389PT1LS8suXboMGDBg1KhR//3vf6sTXCqV5uXl3b9//8qVK6dOndq3b19oaGhAQICXl5dAIBAKhenp6eHh4QAGDx6s7Hf6UkhICICuXbsqtCu+SPTXhcOQEMbjsVGjOEnvHV7eJdu7dy/nwfPymI0NA9iYMdzMsZfJZIGBgQDkFb9Zs2YbNmzgfpoEqTYqo8py+zaLjWUHD7Lt29k337ClS9nnn7Nx49jgwWzSpLi2bduamJj8s+PTs2dPBdudPHkyAC8vr/z8fPngLiEhgZN39DZSqTQrK4sxVl5e3qJFCwArVqyoTSCZjPn4sOHDmY8Pc3ZmycmMMXbyJMvN5TTfN5N/63Ts2JGzu2SMMcYqKipGj57+3ntpPXtyOXFz79698g+Mm5tbpqr2BSD/R2VUxYqLizMzM69fvx4bG3vw4MHo6GgFA96/f19NTU0oFGZkZPj5+QEYNmwYJ6m+zYYNG/T19fft28cYk2+7N23atNoEOnaMzZjx4vebN9mHH3KX47/j/i4ZY4yxKVOmAHBw6JqVxeW5VVu2bAFHi2KJ4qiMNkKfffYZAB8fn7y8PHmHVHmT8NPS0nR0dAAcPnxYPkmAx+Pt2rWrNrH8/Nj+/X/9s1077taZV8vWrVsBvP/++1x1SIOCggBoaWlx/t9/9erVqKt5weRf0TnZjZC/vz+fz9+xY0d5efmsWbMABAQEKKMhmUw2ZcqUsrKyKVOmDB06dMaMGVKp1M/PT94FqzGBAK/OUqjzm5/Tp0+3tLRMTU09duyY4tGio6OXLl0q/1Lh/M5PYWEhAMNXT0chKqTqOk6UwtPTE8DcuXNzcnK0tbV5PN4ff/zBeSvy80hatWpVUFAwf/58KLiE/8wZ9vLk5N9/Z6oYsW7atAlAhw4dkpOTRSJRreOkpqbq6+sDWLNmDYfpvTRt2jQAO3bsUEZwUlM04alxSk1N7dSpk7q6elpa2rp16zZs2DB69OioqCgOm7h7927Xrl3Ly8t//fVXQ0NDFxcXHo936dIlhXpe/v64ehUtWyI7G9u2wcaGu3yrpaKiwsLCori4WCKRyB/R1NSszuGphoaGpqam8gmz+fn5PXv2TEtLGzNmzM8//8yTn9vBqVGjRh0+fPjQoUMjR47kPDipMVXXcaIs8pM2FixYkJ2draWlxePxFF+t/5JUKu3bty+AadOmlZWVtWvXDsCyZcs4CC0WM06XxtbUwYMHmzdvbmJi0qxZsxr9ryQUClu0aNG+fXv5fbZevXopbzGufC+SuLg4JcUnNUK90UYrKSmpW7duWlpaGRkZq1at+u6778aNGyffr15xwcHBvr6+5ubmycnJK1euDAkJsbOzu3bt2qvH2TcO1T8/NTc3V74CzcDAQCQSHTp0yN3d/cKFC4sWLTpw4ICFhQWHWX322XePHiV/951vp07WHIYltaTqOk6UyN3dHYCvr++jR480NDT4fH4KF3um3b59W36o8okTJxISEup+CX/9VFVVlZube+fOHfnpe+3bt6+oqJCPCUaMGMFtW/JdWB884DYqqSUqo41ZYmIij8fT0dF5+vSp/Jb9xIkTFYwpkUicnJwAzJw5s6ysTL7456uvvuIi30aioqKiY8eOAFatWpWVlWVgYADg8OHDHDahq8sAVlLCYUhSe1RGG7mhQ4cC8Pf3l6+yFwgEd+/eVSSgfO6Uubl5YWHhnDlzAHTu3LluzlhuQOLj4+VHmd6+fXvz5s0AzMzMuNoOUSxmPB5TU6vjabXkraiMNnKXL18GoKenV1BQMGPGDABTp06tdbRbt25pamryeLzo6OiLFy/y+Xw1NTVlTKVqBKZOnQrggw8+kEgkzs7OAObMmcNJ5KdPGcBMTDgJRjhAZbTxc3V1lY+709LS1NTUBALBPfn5FTUkFovlk5m8vb1LS0ttbW0BrFy5kvOEG4f8/Hz5DgO7d+++efOmUCjk8/mXLl1SPPLduwxg7dopHolwg8po45eQkABAX1+/sLBQvr5opvzQqBqSL0C0tLQsKSnx8fEB0KVLF64O4GuU5FvYGRsb5+bmLlmypHlze0/PNMU3h/7ttxcnVJF6gspokzBgwAD5HY979+4JBIKPP/64psvGU1NT5cP5U6dOxcbG8ng8dXX1GzduKCnhRsPNbfQHH8RNny59/vx5164S+SGnCoqOZgAbMoSL/AgXqIw2CbGxsQCMjIxKSkru379fiwizZ88G8Pnnn5eWltrY2ABYvXo153k2Pn/+ybS0GI/HYmJYbCzj8Zi2NktLUyjm3r0MYJ9+ylGKRGFURpsKFxcXAHPnzq3dy2Uy2Y4dO0QikXxSZNeuXWk4X00BAbzxw10AAAUJSURBVAxgtrbs+XM2fjwDmIJbad++zYKC2KFDHOVHFEarmJqKffv2ffrppwDU1dWNjY2rs0jczMzstfXgsbGxH330kbq6+tWrVx0cHFT0VhoYiQSOjkhKgr8/5s1Dx47Iy6v9qSg9eyI0FN26ITMTYWHw9samTVi/HgBu38ZPP2H1am7TJ/9OTdUJkDoybty4PXv2REdHV1VVZWdnZ2dn/+tLNDQ0Xq2qurq6p0+fZowtX76camj1qalh2zb07o2gIIwZg7VrMWMG5s3D4MGoxUZ3T59i4ULExKCiAhkZqKhAZuaLP5WX4+FDTlMn1UNltAk5ceJEZWWlRCIpLi6uziLxnP97GcHS0lIgECxevFiF76IhcnKCtzdCQ+HtjYQERETg4kWcOlWbDqmJCQYORGgoBg588UhBAX77DQDu3uUyZ1J9VEabFg0NDQ0NDR0dHfkuRO9WWVn5alV9+vTpokWL8vPzY2JihgwZUgfZNiZr1+LYMfz+O7ZvR3g4iopQzQ0FCwpw/jzi4hAfj59+AoCFC+HiAnv7F0949gxxcQCQlaWc1Mm/oTJK3kpDQ8PU1NT0ldPh8/LyFi9evHLlSiqjNaWnh5AQTJoEiQRt2/7Lk0tL8dtviIlBTAyuX//rTAB5udTQwOrVWLYM5uYA0L49/PwA4No1bNyovHdA3ooOESE14OPj06JFi99++y0mJkbVuTQ8n3yCtDTk5+PJEwAQibBr119/LStDTAyWLEHfvjAygqsrgoKQmAg+H927w9cXZ85g5swXTx40CK98uxEVozv1pGYCAwOXLl3q7OwsXxxFaqpNG/TsiZ9/RnY2pk+Hry9OnEB8PBITIZW+eI6GBpyc8OGH6N8fvXpBU/Ovl1+5AicnAMjPR1YWrK1x5w66dweAkhKkp6NLlzp/S00elVFSM6WlpVZWVnl5eXFxcfI92EmNODmhWzcMH45u3TB9OoyMXlzxVFND58746CP06YMPPoCenqoTJdVGg3pSM82aNZs7dy6AVatWqTqXhmr1anz1FZ4/B4Bx4+Dnh1OnUFSEP/7A2rUYPpxqaANDvVFSY8XFxW3atCkqKrr428U+PfuoOp0GxskJV64gPBzXruHBA5w4oeqEiMKoN0pqTF9ff2nI0t7ne39t9LWqc2mopk9HSoqqkyAcoTJKamPmZzNv690+LTqdUEo3mmpGfiIyn48tWzB0qKqzIVygQT2pJf8s/4CcADd9t+M2x1WdCyGqRGWU1FK+JN8q1UokFV3pcMVRu3orcghpjGhQT2rJWM34c5PPAazJXqPqXAhRJeqNktrLk+RZpViVycqudrjaXbu7qtMhRDWoN0pqz0TNxMvEi4EF5gSqOhdCVIZ6o0Qh2eJsm1QbMRNn2GdYCC1UnQ4hKkC9UaIQM6FZWOuwmx1vWggtymXl8gd/LPhRtVkRUpeoN0o4kFCasPjJYjOhWZ4k71uLb70eev3R4Q9VJ0VIHaH9RomiZJB5PfSKto1urd66RFoi5AlVnREhdYrKKFHUk6onJmomrdVbA9AT0KYapMmha6NEUTLIeOD9+/MIaaSojBJFWQgtciQ5eZI8VSdCiGrQoJ4oSsATfGv+7eA/B/dr1i9LnLWw5UJVZ0RInaI79YQbYiZ+UPXARM3EQGDwTPKsuVpzVWdESB2hMkoIIQqha6OEEKIQKqOEEKIQKqOEEKIQKqOEEKIQKqOEEKKQ/wF9XPqhxKWL9AAAAhl6VFh0cmRraXRQS0wgcmRraXQgMjAyMi4wOS4xAAB4nHu/b+09BiDgZYAARiCWBWJ5IG5gZGNIAIkxsztoAGlmZjYIzQLjsztYgGhGuAQHA5hmgmhkYkJozADTjEgMiA50GmYC3CS4PIZGXAKCDAogV7OBTWBiYYfQzGxgYRZOiCwqxQ30OCNTAhNzBhMzSwILawIrmwbQ9QzsHAwcnAmcXBlMXNwJ3DwZTDy8Cbx8Gkw8/Ar8AgoCghlMgkIJQsIZTMIiCSKiGUwiYgxC4griEhrM3JIMklIMktIMkjIMrIwJvBwJogIJIsysjEALWNk4ubh5eDnYBIWERUQFxNMYIQEPBrL1it0Oej/S7UGcLY+nObSF8OwHsTt8Cxw4Ft0Es1W3T3SYIWh3AMTe43DAQbdXDcyOLj7s0NPeC1azVPiNg6TE3H0gtmKLgEP39TMQ8aAz9m/Vm8Dir6dLO+h9qgXbZVn1yD54sagDiL1mbrfdJA4ZMLv3Nf9+wb4ZYDUZf0/bfvwtC9a7zcpm/0SnzWAzvY4wH+iLWApW89jI8kDteitbsDvreg6c+C9iB2IXmS4/IPm4B6yei2PSgZAZMmA3r3VzOTD5rxKYLXqD74BwxTKwmvuvJQ9Eq1SCxVWlrh+YI9EAFs+pZz5oPycNzD5ucXxflH4R2J3nm9v2C2ovAbN3/yuxF3HtA7MPSt/b333FHcwWAwCbbYsdCdJfdgAAAqx6VFh0TU9MIHJka2l0IDIwMjIuMDkuMQAAeJx9VdtuFDEMfd+vyA9s5LvjR3pDCHUrQeEfkHjk/4WdUTupFJhprN3sie3Y57iXVs+3h6+//rT3hx4ul9bgP38R0X4yAFyeW31od4+fv9za/eunu7ed+5cft9fvjaIx5pl8P2I/vb48v+1gu2/SWXxwNOgBhnWow3zOk9RuiXNUCmtX6CoDTTdATofcXcWZ2hW7A5D6BijToyobeLtSD4tBsgFqAi13PYAKaEBsO4+Woa0j5F1maAS1iA3Q20vzzu4coy7DYPvQIz1SJx1gUh4NgGlsgJE5YtcwciqPpEK4qyPCdCmMKHWHCAzfXQarM1k+C1bNE8wDYpcjUgIzopMNn65HFmgHrNZU64xwjPSNgYNth5TpMrlmIHUE8zpjex3NUibAWcmxSiQAYruioxWUOugYahmeAn1LIfRKlPoI9oOMouq7smM16Jr8DeDZaUwkbMsZhdROlInKwY7s+pbqcDhV4yR5JSJiHjuvVE26cgdLV1JQTShtoXRcihgjpZYJpIiVd5XK7t3/nnFBIUNkgBE0ZNd9qlZdrTthyGR9/uy6vVcJ6Tp6dt8gqmsBMDKvDdQOqvCIFFHpeQgK7JDenmZUhZEl0o78D72nbp6K8OqQlJUsqgZvg8fhsjiV2XFHouzEBvl4e/gwzI7xdvdyezjHW710TjHJxeesklrnRKpXz7mTX5qd04Vy+TlDMM+Oc1JgrjjngeTCVfZSBnHRt5RBWoQsZZAXxUoZlEWZVAZ1USBOjC1Cw2l8EZSUwbEIR8pgLAKRMrTqQMoQLnSXMkQLq3Hu8EJenH5koSiXIV2YWBdPmi2EwzLkC7GOnbEQ6NiJhSlSfXnv56yynG3gmTAtpSjKrASp72//O/Pz5S8+mWdGWlRaGAAAAWt6VFh0U01JTEVTIHJka2l0IDIwMjIuMDkuMQAAeJwtkUtqw2AMhK9S6CaGv0LvB6arQuiqOUDpytuSE+TwldyAwfB5NKORP+50XD7ufBz9uly3fq7bcbkdMuD1a5vX73bIdvD7bbt/3z9/Dnq/vTwuCqKRuhAKnYTWrhBkst4QTJPc1i4QptGIIBDZ187AluhDHFFk7QRWzjFjbMpEI1IhsrEuqogRhZfYYhBJrLW3ONgz1mgTee09L1mEtjo1lbQRAVqrDUjO9Da0QOKloGolpyTEmHuImDthfMyZMhcBFeX4cPtkehMumn2aZEk3Q0A1i5jBvkjhHIQa4agMmG26tRWiZyOHYCodhIjnXELHOc0FCjGZTy9z0VwdpHoOCqB70hBTDzmX0O7Hq79l8XNTlqe9IJpOZ6xyVB1/6gu3x5Sc2kEjVESN+X1m4mdAeWVv4b1i1ABHlhhAGCn/faxBgEScztL1WNf2+AO4M3nuv6Sq1gAAAABJRU5ErkJggg==\n", "text/plain": [ "" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "doravirine = Chem.MolFromSmiles('Cn1c(n[nH]c1=O)Cn2ccc(c(c2=O)Oc3cc(cc(c3)Cl)C#N)C(F)(F)F')\n", "doravirine" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `Descriptors` module has a list of the available descriptors. The list is made of (name, function) 2-tuples:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2022-12-20T05:18:55.586164Z", "start_time": "2022-12-20T05:18:55.574240Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "208\n", "[('MaxEStateIndex', ), ('MinEStateIndex', ), ('MaxAbsEStateIndex', ), ('MinAbsEStateIndex', ), ('qed', )]\n" ] } ], "source": [ "from rdkit.Chem import Descriptors\n", "print(len(Descriptors._descList))\n", "print(Descriptors._descList[:5])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use those functions to directly calculate the corresponding descriptor. So, for example, the value of `MaxEStateIndex` for doravirine is:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "ExecuteTime": { "end_time": "2022-12-20T05:19:00.014047Z", "start_time": "2022-12-20T05:19:00.001327Z" } }, "outputs": [ { "data": { "text/plain": [ "13.412553309006833" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Descriptors._descList[0][1](doravirine)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As an aside, if we just want a few named descriptors, it's a lot clearer (and easier to write the code!) if we call the individual descriptor functions directly:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "ExecuteTime": { "end_time": "2022-12-20T05:19:01.156995Z", "start_time": "2022-12-20T05:19:01.145963Z" } }, "outputs": [ { "data": { "text/plain": [ "13.412553309006833" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Descriptors.MaxEStateIndex(doravirine)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Often we want to calculate all the descriptors. As of the 2022.09 release of the rdkit there's no real convenience function for descriptor calculation, so let's create one:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "ExecuteTime": { "end_time": "2022-12-20T05:19:02.305280Z", "start_time": "2022-12-20T05:19:02.302703Z" } }, "outputs": [], "source": [ "def getMolDescriptors(mol, missingVal=None):\n", " ''' calculate the full list of descriptors for a molecule\n", " \n", " missingVal is used if the descriptor cannot be calculated\n", " '''\n", " res = {}\n", " for nm,fn in Descriptors._descList:\n", " # some of the descriptor fucntions can throw errors if they fail, catch those here:\n", " try:\n", " val = fn(mol)\n", " except:\n", " # print the error message:\n", " import traceback\n", " traceback.print_exc()\n", " # and set the descriptor value to whatever missingVal is\n", " val = missingVal\n", " res[nm] = val\n", " return res\n", " " ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "ExecuteTime": { "end_time": "2022-12-20T05:19:03.356076Z", "start_time": "2022-12-20T05:19:03.319709Z" } }, "outputs": [ { "data": { "text/plain": [ "{'MaxEStateIndex': 13.412553309006833,\n", " 'MinEStateIndex': -4.871620672188628,\n", " 'MaxAbsEStateIndex': 13.412553309006833,\n", " 'MinAbsEStateIndex': 0.045220418860841605,\n", " 'qed': 0.6914051268589834,\n", " 'MolWt': 425.754,\n", " 'HeavyAtomMolWt': 414.66600000000005,\n", " 'ExactMolWt': 425.050251552,\n", " 'NumValenceElectrons': 150,\n", " 'NumRadicalElectrons': 0,\n", " 'MaxPartialCharge': 0.4197525104273902,\n", " 'MinPartialCharge': -0.45079941098947357,\n", " 'MaxAbsPartialCharge': 0.45079941098947357,\n", " 'MinAbsPartialCharge': 0.4197525104273902,\n", " 'FpDensityMorgan1': 1.3103448275862069,\n", " 'FpDensityMorgan2': 2.0344827586206895,\n", " 'FpDensityMorgan3': 2.6206896551724137,\n", " 'BCUT2D_MWHI': 35.495691906445956,\n", " 'BCUT2D_MWLOW': 10.182401353178228,\n", " 'BCUT2D_CHGHI': 2.363442602497932,\n", " 'BCUT2D_CHGLO': -2.1532454345808123,\n", " 'BCUT2D_LOGPHI': 2.362094239067197,\n", " 'BCUT2D_LOGPLOW': -2.2620565247489415,\n", " 'BCUT2D_MRHI': 6.30376236817795,\n", " 'BCUT2D_MRLOW': -0.13831572005086737,\n", " 'BalabanJ': 2.1143058157682066,\n", " 'BertzCT': 1236.821427505276,\n", " 'Chi0': 21.344570503761737,\n", " 'Chi0n': 14.619315272563007,\n", " 'Chi0v': 15.375244218581463,\n", " 'Chi1': 13.595574016164479,\n", " 'Chi1n': 7.8933192308003095,\n", " 'Chi1v': 8.271283703809537,\n", " 'Chi2n': 5.882827756329733,\n", " 'Chi2v': 6.319263536801718,\n", " 'Chi3n': 3.9307609940961763,\n", " 'Chi3v': 4.148978884332168,\n", " 'Chi4n': 2.4772835642835087,\n", " 'Chi4v': 2.7023697348309867,\n", " 'HallKierAlpha': -3.519999999999999,\n", " 'Ipc': 2291995.915536308,\n", " 'Kappa1': 20.220355828454835,\n", " 'Kappa2': 7.4789147435283585,\n", " 'Kappa3': 4.168020338062062,\n", " 'LabuteASA': 164.8909024413842,\n", " 'PEOE_VSA1': 9.303962601591405,\n", " 'PEOE_VSA10': 11.3129633249809,\n", " 'PEOE_VSA11': 5.824404497999927,\n", " 'PEOE_VSA12': 5.749511833283905,\n", " 'PEOE_VSA13': 5.559266895052007,\n", " 'PEOE_VSA14': 11.86604191564695,\n", " 'PEOE_VSA2': 9.361636831863176,\n", " 'PEOE_VSA3': 9.893218992372859,\n", " 'PEOE_VSA4': 23.531818506063985,\n", " 'PEOE_VSA5': 0.0,\n", " 'PEOE_VSA6': 11.600939890232516,\n", " 'PEOE_VSA7': 24.26546827384644,\n", " 'PEOE_VSA8': 18.267148868031594,\n", " 'PEOE_VSA9': 18.177429210401844,\n", " 'SMR_VSA1': 17.908108096824506,\n", " 'SMR_VSA10': 11.600939890232516,\n", " 'SMR_VSA2': 5.261891554738487,\n", " 'SMR_VSA3': 19.331562912184786,\n", " 'SMR_VSA4': 7.04767198267719,\n", " 'SMR_VSA5': 12.72105492335605,\n", " 'SMR_VSA6': 0.0,\n", " 'SMR_VSA7': 73.27433730199388,\n", " 'SMR_VSA8': 0.0,\n", " 'SMR_VSA9': 17.568244979360085,\n", " 'SlogP_VSA1': 15.98587324705553,\n", " 'SlogP_VSA10': 13.171245143024459,\n", " 'SlogP_VSA11': 11.49902366656781,\n", " 'SlogP_VSA12': 11.600939890232516,\n", " 'SlogP_VSA2': 19.331562912184786,\n", " 'SlogP_VSA3': 19.76872690603324,\n", " 'SlogP_VSA4': 11.33111286753076,\n", " 'SlogP_VSA5': 16.95130748139392,\n", " 'SlogP_VSA6': 40.05138621360316,\n", " 'SlogP_VSA7': 5.022633313741326,\n", " 'SlogP_VSA8': 0.0,\n", " 'SlogP_VSA9': 0.0,\n", " 'TPSA': 105.70000000000002,\n", " 'EState_VSA1': 28.738272135679853,\n", " 'EState_VSA10': 22.760319511168106,\n", " 'EState_VSA11': 0.0,\n", " 'EState_VSA2': 28.704757542634727,\n", " 'EState_VSA3': 6.06636706846161,\n", " 'EState_VSA4': 21.397409935657397,\n", " 'EState_VSA5': 19.18040611960041,\n", " 'EState_VSA6': 6.069221312792274,\n", " 'EState_VSA7': 0.0,\n", " 'EState_VSA8': 10.197363616602075,\n", " 'EState_VSA9': 21.599694398771053,\n", " 'VSA_EState1': 47.48050639865553,\n", " 'VSA_EState10': 5.842061004535676,\n", " 'VSA_EState2': 24.16343117595945,\n", " 'VSA_EState3': 14.921853617262808,\n", " 'VSA_EState4': -2.8980189732872814,\n", " 'VSA_EState5': -1.0781549918202147,\n", " 'VSA_EState6': 6.092225491490601,\n", " 'VSA_EState7': -3.945179835565914,\n", " 'VSA_EState8': -0.2762282865821226,\n", " 'VSA_EState9': 1.3919488437959202,\n", " 'FractionCSP3': 0.17647058823529413,\n", " 'HeavyAtomCount': 29,\n", " 'NHOHCount': 1,\n", " 'NOCount': 8,\n", " 'NumAliphaticCarbocycles': 0,\n", " 'NumAliphaticHeterocycles': 0,\n", " 'NumAliphaticRings': 0,\n", " 'NumAromaticCarbocycles': 1,\n", " 'NumAromaticHeterocycles': 2,\n", " 'NumAromaticRings': 3,\n", " 'NumHAcceptors': 7,\n", " 'NumHDonors': 1,\n", " 'NumHeteroatoms': 12,\n", " 'NumRotatableBonds': 4,\n", " 'NumSaturatedCarbocycles': 0,\n", " 'NumSaturatedHeterocycles': 0,\n", " 'NumSaturatedRings': 0,\n", " 'RingCount': 3,\n", " 'MolLogP': 2.65458,\n", " 'MolMR': 94.87570000000002,\n", " 'fr_Al_COO': 0,\n", " 'fr_Al_OH': 0,\n", " 'fr_Al_OH_noTert': 0,\n", " 'fr_ArN': 0,\n", " 'fr_Ar_COO': 0,\n", " 'fr_Ar_N': 4,\n", " 'fr_Ar_NH': 1,\n", " 'fr_Ar_OH': 0,\n", " 'fr_COO': 0,\n", " 'fr_COO2': 0,\n", " 'fr_C_O': 0,\n", " 'fr_C_O_noCOO': 0,\n", " 'fr_C_S': 0,\n", " 'fr_HOCCN': 0,\n", " 'fr_Imine': 0,\n", " 'fr_NH0': 4,\n", " 'fr_NH1': 1,\n", " 'fr_NH2': 0,\n", " 'fr_N_O': 0,\n", " 'fr_Ndealkylation1': 0,\n", " 'fr_Ndealkylation2': 0,\n", " 'fr_Nhpyrrole': 1,\n", " 'fr_SH': 0,\n", " 'fr_aldehyde': 0,\n", " 'fr_alkyl_carbamate': 0,\n", " 'fr_alkyl_halide': 3,\n", " 'fr_allylic_oxid': 0,\n", " 'fr_amide': 0,\n", " 'fr_amidine': 0,\n", " 'fr_aniline': 0,\n", " 'fr_aryl_methyl': 0,\n", " 'fr_azide': 0,\n", " 'fr_azo': 0,\n", " 'fr_barbitur': 0,\n", " 'fr_benzene': 1,\n", " 'fr_benzodiazepine': 0,\n", " 'fr_bicyclic': 0,\n", " 'fr_diazo': 0,\n", " 'fr_dihydropyridine': 0,\n", " 'fr_epoxide': 0,\n", " 'fr_ester': 0,\n", " 'fr_ether': 1,\n", " 'fr_furan': 0,\n", " 'fr_guanido': 0,\n", " 'fr_halogen': 4,\n", " 'fr_hdrzine': 0,\n", " 'fr_hdrzone': 0,\n", " 'fr_imidazole': 0,\n", " 'fr_imide': 0,\n", " 'fr_isocyan': 0,\n", " 'fr_isothiocyan': 0,\n", " 'fr_ketone': 0,\n", " 'fr_ketone_Topliss': 0,\n", " 'fr_lactam': 0,\n", " 'fr_lactone': 0,\n", " 'fr_methoxy': 0,\n", " 'fr_morpholine': 0,\n", " 'fr_nitrile': 1,\n", " 'fr_nitro': 0,\n", " 'fr_nitro_arom': 0,\n", " 'fr_nitro_arom_nonortho': 0,\n", " 'fr_nitroso': 0,\n", " 'fr_oxazole': 0,\n", " 'fr_oxime': 0,\n", " 'fr_para_hydroxylation': 0,\n", " 'fr_phenol': 0,\n", " 'fr_phenol_noOrthoHbond': 0,\n", " 'fr_phos_acid': 0,\n", " 'fr_phos_ester': 0,\n", " 'fr_piperdine': 0,\n", " 'fr_piperzine': 0,\n", " 'fr_priamide': 0,\n", " 'fr_prisulfonamd': 0,\n", " 'fr_pyridine': 1,\n", " 'fr_quatN': 0,\n", " 'fr_sulfide': 0,\n", " 'fr_sulfonamd': 0,\n", " 'fr_sulfone': 0,\n", " 'fr_term_acetylene': 0,\n", " 'fr_tetrazole': 0,\n", " 'fr_thiazole': 0,\n", " 'fr_thiocyan': 0,\n", " 'fr_thiophene': 0,\n", " 'fr_unbrch_alkane': 0,\n", " 'fr_urea': 0}" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "getMolDescriptors(doravirine)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Suppose I want to generate the full set of descriptors for a bunch of molecules..." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "ExecuteTime": { "end_time": "2022-12-20T05:19:07.446239Z", "start_time": "2022-12-20T05:19:07.335355Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "canonical_smiles molregno activity_id standard_value standard_units\r\n", "N[C@@H]([C@@H]1CC[C@H](CC1)NS(=O)(=O)c2ccc(F)cc2F)C(=O)N3CC[C@H](F)C3 29272 671631 49000 nM\r\n", "N[C@@H](C1CCCCC1)C(=O)N2CCSC2 29758 674222 28000 nM\r\n", "N[C@@H]([C@@H]1CC[C@H](CC1)NC(=O)c2ccc(F)c(F)c2)C(=O)N3CCSC3 29449 675583 5900 nM\r\n", "N[C@@H]([C@@H]1CC[C@H](CC1)NS(=O)(=O)c2ccc(F)cc2F)C(=O)N3CCCC3 29244 675588 35000 nM\r\n", "N[C@@H]([C@@H]1CC[C@H](CC1)NS(=O)(=O)c2ccc(OC(F)(F)F)cc2)C(=O)N3CC[C@@H](F)C3 29265 679299 6000 nM\r\n", "N[C@@H]([C@@H]1CC[C@H](CC1)NS(=O)(=O)c2ccc(F)cc2F)C(=O)N3CC[C@@H](F)C3 29253 679302 52000 nM\r\n", "N[C@@H]([C@@H]1CC[C@H](CC1)NC(=O)c2ccc(F)c(F)c2)C(=O)N3CCCC3 29482 683566 29000 nM\r\n", "N[C@@H]([C@@H]1CC[C@H](CC1)NC(=O)c2ccccc2C(F)(F)F)C(=O)N3CCSC3 29340 685042 39000 nM\r\n", "N[C@@H]([C@@H]1CC[C@H](CC1)NC(=O)OCc2ccccc2)C(=O)N3CC[C@@H](F)C3 29213 685047 43000 nM\r\n" ] } ], "source": [ "!head ../data/herg_data.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "We can read in all the molecules using a \"Supplier\" object, there's more about this [in the documentation](https://www.rdkit.org/docs/GettingStartedInPython.html#reading-sets-of-molecules)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "ExecuteTime": { "end_time": "2022-12-20T05:19:08.934866Z", "start_time": "2022-12-20T05:19:08.767216Z" } }, "outputs": [ { "data": { "text/plain": [ "1090" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "suppl = Chem.SmilesMolSupplier('../data/herg_data.txt')\n", "mols = [m for m in suppl]\n", "len(mols)" ] }, { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2022-12-20T04:36:08.368224Z", "start_time": "2022-12-20T04:36:08.365600Z" } }, "source": [ "Now calculate the descriptors. This takes a bit (10-20 seconds on my machine) for the ~1100 molecules I read in." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "ExecuteTime": { "end_time": "2022-12-20T05:19:18.220827Z", "start_time": "2022-12-20T05:19:09.896088Z" } }, "outputs": [], "source": [ "allDescrs = [getMolDescriptors(m) for m in mols]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The problem here is that we have a list of dictionaries... that's not useful for most things. Let's convert it to a pandas dataframe:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "ExecuteTime": { "end_time": "2022-12-20T05:19:18.393439Z", "start_time": "2022-12-20T05:19:18.221859Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
MaxEStateIndexMinEStateIndexMaxAbsEStateIndexMinAbsEStateIndexqedMolWtHeavyAtomMolWtExactMolWtNumValenceElectronsNumRadicalElectrons...fr_sulfidefr_sulfonamdfr_sulfonefr_term_acetylenefr_tetrazolefr_thiazolefr_thiocyanfr_thiophenefr_unbrch_alkanefr_urea
013.787943-4.12037313.7879430.0743170.759946419.469395.277419.1490471560...0100000000
112.032152-0.23240712.0321520.1869440.777429228.361208.201228.129634860...1000000000
213.255664-1.03618513.2556640.0178450.835147383.464360.280383.1479041420...1000000000
313.787093-4.07256013.7870930.0151960.786287401.479376.279401.1584691500...0100000000
413.326286-4.85925413.3262860.0639660.625645467.485442.285467.1501901740...0100000000
\n", "

5 rows × 208 columns

\n", "
" ], "text/plain": [ " MaxEStateIndex MinEStateIndex MaxAbsEStateIndex MinAbsEStateIndex \\\n", "0 13.787943 -4.120373 13.787943 0.074317 \n", "1 12.032152 -0.232407 12.032152 0.186944 \n", "2 13.255664 -1.036185 13.255664 0.017845 \n", "3 13.787093 -4.072560 13.787093 0.015196 \n", "4 13.326286 -4.859254 13.326286 0.063966 \n", "\n", " qed MolWt HeavyAtomMolWt ExactMolWt NumValenceElectrons \\\n", "0 0.759946 419.469 395.277 419.149047 156 \n", "1 0.777429 228.361 208.201 228.129634 86 \n", "2 0.835147 383.464 360.280 383.147904 142 \n", "3 0.786287 401.479 376.279 401.158469 150 \n", "4 0.625645 467.485 442.285 467.150190 174 \n", "\n", " NumRadicalElectrons ... fr_sulfide fr_sulfonamd fr_sulfone \\\n", "0 0 ... 0 1 0 \n", "1 0 ... 1 0 0 \n", "2 0 ... 1 0 0 \n", "3 0 ... 0 1 0 \n", "4 0 ... 0 1 0 \n", "\n", " fr_term_acetylene fr_tetrazole fr_thiazole fr_thiocyan fr_thiophene \\\n", "0 0 0 0 0 0 \n", "1 0 0 0 0 0 \n", "2 0 0 0 0 0 \n", "3 0 0 0 0 0 \n", "4 0 0 0 0 0 \n", "\n", " fr_unbrch_alkane fr_urea \n", "0 0 0 \n", "1 0 0 \n", "2 0 0 \n", "3 0 0 \n", "4 0 0 \n", "\n", "[5 rows x 208 columns]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "df = pd.DataFrame(allDescrs)\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And now we have something that we could use to build models, filter, etc." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.4" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }