{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "code", "collapsed": false, "input": [ "#functions to help visualise output\n", "\n", "from rdkit import Chem\n", "from rdkit.Chem.Draw import IPythonConsole\n", "from rdkit.Chem import Draw\n", "from rdkit.Chem import AllChem\n", "\n", "def depict(input):\n", " if(\">>\" in input):\n", " rxn = AllChem.ReactionFromSmarts(input) \n", " return Draw.ReactionToImage(rxn)\n", " else:\n", " temp = Chem.MolFromSmiles(input)\n", " return temp\n", " \n", "def showFile(in_file):\n", " fo = open(in_file, \"r\")\n", " \n", " mols=[]\n", " ids=[]\n", " n=0\n", " for line in fo:\n", " if(n < 10):\n", " smi,id = line.rstrip().split()\n", " mols.append( Chem.MolFromSmiles(smi) )\n", " ids.append(id)\n", " n=n+1\n", " \n", " return Draw.MolsToGridImage(mols,molsPerRow=5,legends=ids)\n", "\n", "def showMMPs(in_file):\n", " fo = open(in_file, \"r\")\n", " \n", " mols=[]\n", " trans=[]\n", " n=0\n", " for line in fo:\n", " if(n < 9):\n", " m1,m2,id1,id2,t,c = line.rstrip().split(\",\")\n", " smi = \"%s.%s\" % (m1,m2)\n", " mols.append( Chem.MolFromSmiles(smi) )\n", " trans.append(t)\n", " n=n+1\n", " \n", " return Draw.MolsToGridImage(mols,molsPerRow=3,subImgSize= ( 300 , 300 ) ,legends=trans)\n", "\n", "def showLine(in_string):\n", " f = in_string.split(\",\")\n", " \n", " rxn =f[-2].split(\">>\") \n", " \n", " mols=[]\n", " ids=[]\n", " \n", " mols.append( Chem.MolFromSmiles(f[-6]) )\n", " mols.append( Chem.MolFromSmiles(f[-5]) )\n", " mols.append( Chem.MolFromSmiles(rxn[0]) )\n", " mols.append( Chem.MolFromSmiles(rxn[1]) )\n", " mols.append( Chem.MolFromSmiles(f[-1]) )\n", " ids.append(f[-3])\n", " ids.append(f[-4])\n", " ids.append(\"LHS\") \n", " ids.append(\"RHS\") \n", " ids.append(\"CONTEXT\") \n", " \n", " return Draw.MolsToGridImage(mols,molsPerRow=6,legends=ids)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "To find all the MMPs in your set" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The program to generate the MMPs from a set is divided into two parts; fragmentation and indexing. \n", "\n", "Before running the programs, make sure your input set of SMILES:\n", "\n", "* Does not contain mixtures (salts etc.) \n", "* Does not contain * atoms\n", "* Has been canonicalised using RDKit.\n", "\n", "If your smiles set doesn't satisfy the conditions above the programs are likely to fail or in the case of \n", "canonicalisation result in not identifying MMPs involving H atom substitution.\n", "\n", "The file sample.smi has been canonicalised using RDKit.\n" ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "1) Fragmentation command:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "cd t1_files/" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "ls" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***Note:*** You need a ! to run a Linux command in this ipython notebook. It's not needed on the command line. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "!wc -l sample.smi" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "!head sample.smi" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "showFile('sample.smi')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Example fragmentation command:\n", "\n", " python $RDBASE/Contrib/mmpa/rfrag.py FRAGMENT_OUTPUT\n", "\n", "Program help (use the command: rfrag.py -h)\n", "\n", " Program that fragments a user input set of smiles.\n", " The program enumerates every single,double and triple acyclic single bond cuts in a molecule.\n", " \n", " USAGE: ./rfrag.py out_fragmented.txt" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lets have a look at the output:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "!head out_fragmented.txt" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "2) Index command:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Example indexing command:\n", "\n", " python $RDBASE/Contrib/mmpa/indexing.py MMP_OUTPUT.CSV\n", "\n", "Format of output:\n", "SMILES_OF_LEFT_MMP,SMILES_OF_RIGHT_MMP,ID_OF_LEFT_MMP,ID_OF_RIGHT_MMP,SMIRKS_OF_TRANSFORMATION,SMILES_OF_CONTEXT \n", "\n", "Program help (use the command: indexing.py -h)\n", "\n", " Usage: indexing.py [options]\n", "\n", " Program to generate MMPs\n", " \n", " Options:\n", " -h, --help show this help message and exit\n", " -s, --symmetric Output symmetrically equivalent MMPs, i.e output both\n", " cmpd1,cmpd2, SMIRKS:A>>B and cmpd2,cmpd1, SMIRKS:B>>A\n", " -m MAXSIZE, --maxsize=MAXSIZE\n", " Maximum size of change (in heavy atoms) allowed in\n", " matched molecular pairs identified. DEFAULT=10.\n", " Note: This option overrides the ratio option if both\n", " are specified.\n", " -r RATIO, --ratio=RATIO\n", " Maximum ratio of change allowed in matched molecular\n", " pairs identified. The ratio is: size of change /\n", " size of cmpd (in terms of heavy atoms). DEFAULT=0.3.\n", " Note: If this option is used with the maxsize option,\n", " the maxsize option will be used.\n", "\n", "Lets some an examples:\n", "\n", "Default settings:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "!python $RDBASE/Contrib/mmpa/indexing.py out_mmps_default.csv" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "!head out_mmps_default.csv" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "showLine(\"Cc1cccn2cc(-c3cccc(S(=O)(=O)N4CCCCC4)c3)nc12,Cc1cccn2cc(-c3ccc(S(=O)(=O)N4CCCCC4)cc3)nc12,2963575,1156028,[*:1]c1cccc([*:2])c1>>[*:1]c1ccc([*:2])cc1,[*:1]c1cn2cccc(C)c2n1.[*:2]S(=O)(=O)N1CCCCC1\")" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "showMMPs(\"out_mmps_default.csv\")" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Output MMPs where maximum size of change is 3 heavy atoms:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "!python $RDBASE/Contrib/mmpa/indexing.py -m 3 >[*:1]O,[*:1]c1ccc(-c2cc3ccccc3oc2=O)cc1\")" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Output MMPs where no more that 10% of the compound has changed (and using the symmetric option):" ] }, { "cell_type": "code", "collapsed": false, "input": [ "!python $RDBASE/Contrib/mmpa/indexing.py -r 0.1 >[*:1]O,[*:1]c1ccc(-c2cc3ccccc3oc2=O)cc1\")" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "!python $RDBASE/Contrib/mmpa/indexing.py -r 0.1 -s out_mmps_sym.csv" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "!head out_mmps_sym.csv\n", "showMMPs(\"out_mmps_sym.csv\")" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "SMIRKS canonicalisation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Take a look at the following SMIRKS:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "depict(\"[*:2]c1ccc([*:1])o1>>[*:1]c1ccc([*:2])cc1\")" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "depict(\"[*:1]c1ccc([*:2])o1>>[*:2]c1ccc([*:1])cc1\")" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice the SMIRKS strings are different but the change they describe are the **same** (as the positions on the furan and benzene rings are **symmetrically equivalent**).\n", "\n", "The algorithm used to canonicalise SMIRKS is as follows:\n", "\n", "1. Canonicalise the LHS.\n", "2. For the LHS the 1st asterisk (attachment point) in the SMILES will have label 1, 2nd asterisk will have label 2 and so on\n", "3. For the RHS, if you have a choice (ie. two attachment points are symmetrically equivalent), always put the label with lower numerical value on the earlier attachment point in the canonicalised SMILES\n", "\n", "\n", "The MMP identification script uses a SMIRKS canonicalisation routine so the same change always has the same output SMIRKS. To canonicalise a SMIRKS (generated elsewhere) so it is in the same format as MMP identification scripts use command:\n", "\n", " cansmirk.py SMIRKS_OUTPUT_FILE\n", "\n", "Format of SMIRKS_FILE (space or comma separated): SMIRKS ID \n", "\n", "Format of output: CANONICALISED_SMIRKS ID\n", "\n", "Note: The script will **NOT** deal with SMARTS characters, so the SMIRKS must contain **valid SMILES** for left and right hand sides.\n", "\n", "Let's try an example:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "!head sample_smirks.txt" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "!python $RDBASE/Contrib/mmpa/cansmirk.py OUTPUT_FILE\n", "\n", "If you want to use a set SMIRKS generated elsewhere, please make sure they have been canonicalised using the cansmirk.py command.\n", "\n", "Program help:\n", " Usage: mol_transform.py [options]\n", "\n", " Program to apply transformations to a set of input molecules\n", "\n", " Options:\n", " -h, --help show this help message and exit\n", " -f TRANSFORM_FILE, --file=TRANSFORM_FILE\n", " The file containing the transforms to apply to your\n", " input SMILES\n", "\n", " Example command: mol_transform.py -f TRANSFORM_FILE \n", " Format of transform file: transform \n", " Output: SMILES,ID,Transfrom,Modified_SMILES\n", "\n", "\n", "Let's go through an example:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "!head sample_smirks_mol_trans.txt" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "depict(\"[*:1]C(=O)O>>[*:1]C(N)=O\")" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "!head sample_smiles_mol_trans.smi" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "depict(\"O=C(O)c1ccc(NC(=O)C2COc3ccccc3O2)cc1\")" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's run the command:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "!python $RDBASE/Contrib/mmpa/mol_transform.py -f sample_smirks_mol_trans.txt