{
 "metadata": {
  "name": ""
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "code",
     "collapsed": true,
     "input": [
      "#helper functions\n",
      "\n",
      "from rdkit import Chem\n",
      "from rdkit.Chem.Draw import IPythonConsole\n",
      "from rdkit.Chem import Draw\n",
      "from rdkit.Chem import AllChem\n",
      "\n",
      "def depict(input):\n",
      "    if(\">>\" in input):\n",
      "        rxn = AllChem.ReactionFromSmarts(input)       \n",
      "        return Draw.ReactionToImage(rxn)\n",
      "    else:\n",
      "        temp = Chem.MolFromSmiles(input)\n",
      "        return temp\n",
      "\n",
      "def showMMPs(in_string):\n",
      "    f = in_string.split(\",\")\n",
      "    \n",
      "    rxn =f[-2].split(\">>\")      \n",
      "        \n",
      "    mols=[]\n",
      "    ids=[]\n",
      "    \n",
      "    mols.append( Chem.MolFromSmiles(f[-6]) )\n",
      "    mols.append( Chem.MolFromSmiles(f[-5]) )\n",
      "    mols.append( Chem.MolFromSmiles(rxn[0]) )\n",
      "    mols.append( Chem.MolFromSmiles(rxn[1]) )\n",
      "    mols.append( Chem.MolFromSmiles(f[-1]) )\n",
      "    ids.append(f[-3])\n",
      "    ids.append(f[-4])\n",
      "    ids.append(\"LHS\")  \n",
      "    ids.append(\"RHS\")  \n",
      "    ids.append(\"CONTEXT\")  \n",
      "    \n",
      "    return Draw.MolsToGridImage(mols,molsPerRow=6,legends=ids)\n",
      "\n",
      "def showLine(in_string):\n",
      "    f = in_string.split(\",\")\n",
      "    \n",
      "    mols=[]\n",
      "    ids=[]\n",
      "    \n",
      "    mols.append( Chem.MolFromSmiles(f[0]) )\n",
      "    mols.append( Chem.MolFromSmiles(f[1]) )\n",
      "    mols.append( Chem.MolFromSmiles(f[4]) )\n",
      "    mols.append( Chem.MolFromSmiles(f[5]) )\n",
      "    ids.append(\"Query:%s\" % f[2])\n",
      "    ids.append(f[3])\n",
      "    ids.append(\"CHANGE\")\n",
      "    ids.append(\"CONTEXT\")   \n",
      "    \n",
      "    return Draw.MolsToGridImage(mols,molsPerRow=4,legends=ids)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "Generating and searching an MMP database"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The pair index used in the MMP identification algorithm can be written to a relational database. For the indexing.py\n",
      "program already described, the index is written to memory and the program will identify all the MMPs in the dataset.\n",
      "**However, if you just want to ask a (series of) specific questions on a dataset, a relational database containing the\n",
      "pair index (MMP db) can be used to do that.**\n",
      "\n",
      "The program **create_mmp_db.py** will build a MMP db for a given dataset and the program **search_mmp_db.py** can be used to\n",
      "search the MMP db. The types of searching that can be performed on the db are as follows:\n",
      "\n",
      "1.   Find all MMPs of an input/query compound to the compounds in the db\n",
      "2.   Find all MMPs in the db where the LHS of the transform matches an input substructure\n",
      "3.   Find all MMPs that match the input transform/SMIRKS\n",
      "4.   Find all MMPs in the db where the LHS of the transform matches an input SMARTS \n",
      "5.   Find all MMPs that match the LHS and RHS SMARTS of the input transform\n",
      "\n",
      "The SMARTS searching utilises the DbCLI tools (http://code.google.com/p/rdkit/wiki/UsingTheDbCLI) that are part of the RDKit distribution.\n"
     ]
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Generating the db"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "To generate an MMP db use the following command:\n",
      "\n",
      "    python $RDBASE/Contrib/mmpa/create_mmp_db.py <FRAGMENT_OUTPUT\n",
      "\n",
      "The program takes a FRAGMENT_OUTPUT generated by the rfrag.py command (described above) as input.\n",
      "\n",
      "This program has several options (see help from program below):\n",
      "\n",
      "    Usage: create_mmp_db.py [options]\n",
      "\n",
      "    Program to create an MMP db.\n",
      "\n",
      "    Options:\n",
      "      -h, --help        show this help message and exit\n",
      "      -p PREFIX, --prefix=PREFIX\n",
      "                        Prefix to use for the db file (and directory for\n",
      "                        SMARTS index). DEFAULT=mmp\n",
      "      -m MAXSIZE, --maxsize=MAXSIZE\n",
      "                        Maximum size of change (in heavy atoms) that is stored\n",
      "                        in the database. DEFAULT=15.\n",
      "                        Note: Any MMPs that involve a change greater than this\n",
      "                        value will not be stored in the database and hence not\n",
      "                        be identified in the searching.\n",
      "      -s, --smarts      Build SMARTS db so can perform SMARTS searching\n",
      "                        against db. Note: Will make the build process somewhat\n",
      "                        slower.\n",
      "\n",
      "Let's build a MMP db:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "cd t2_files/"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "ls"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!python $RDBASE/Contrib/mmpa/create_mmp_db.py <sample_fragmented.txt"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "ls"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "A sqllite3 db file has be created called **mmp.db**\n",
      "\n",
      "Other sample commands..\n",
      "\n",
      "Generate a db with the prefix \"my_MMP_db\" and SMARTS searching capability:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!python $RDBASE/Contrib/mmpa/create_mmp_db.py -p my_MMP_db -s <sample_fragmented.txt"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "ls"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Notice the file: **my_MMP_db.db** and the directory: **my_MMP_db_smarts/** which is needed for the substructure searching"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "ls my_MMP_db_smarts/"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Generate a db with SMARTS searching capability and where only changes up to (and including) 10 heavy atoms are stored:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!python $RDBASE/Contrib/mmpa/create_mmp_db.py -m 10 -s <sample_fragmented.txt"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Searching the db"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "To search the MMP db use the following command:\n",
      "\n",
      "    python $RDBASE/Contrib/mmpa/search_mmp_db.py [options] <INPUT_FILE\n",
      "\n",
      "This program has several options (see help from program below):\n",
      "\n",
      "    Options:\n",
      "      -h, --help            show this help message and exit\n",
      "      -t TYPE, --type=TYPE  Type of search required. Options are: mmp, subs,\n",
      "                            trans, subs_smarts, trans_smarts\n",
      "      -m MAXSIZE, --maxsize=MAXSIZE\n",
      "                            Maximum size of change (in heavy atoms) allowed in\n",
      "                            matched molecular pairs identified. DEFAULT=10.\n",
      "                            Note: This option overrides the ratio option if both\n",
      "                            are specified.\n",
      "      -r RATIO, --ratio=RATIO\n",
      "                            Only applicable with the mmp search type. Maximum\n",
      "                            ratio of change allowed in matched molecular pairs\n",
      "                            identified. The ratio is: size of change /\n",
      "                            size of cmpd (in terms of heavy atoms) for the QUERY\n",
      "                            MOLECULE. DEFAULT=0.3. Note: If this option is used\n",
      "                            with the maxsize option, the maxsize option will be\n",
      "                            used.\n",
      "      -p PREFIX, --prefix=PREFIX\n",
      "                            Prefix for the db file. DEFAULT=mmp\n",
      "\n",
      "A description of the different search options are shown below:\n",
      "\n",
      "a) **mmp**: Find all MMPs of a input/query compound to the compounds in the db\n",
      "\n",
      "b) **subs**: Find all MMPs in the db where the LHS of the transform matches an input substructure. \n",
      "\n",
      "c) **trans**: Find all MMPs that match the input transform/SMIRKS. \n",
      "\n",
      "d) **subs_smarts**: Find all MMPs in the db where the LHS of the transform matches an input SMARTS. The attachment points in the SMARTS can be denoted by [#0] (eg.[#0]c1ccccc1).\n",
      "\n",
      "e) **trans_smarts**: Find all MMPs that match the LHS and RHS SMARTS of the input transform. The transform SMARTS are input as LHS_SMARTS>>RHS_SMARTS (eg. [#0]c1ccccc1>>[#0]c1ccncc1). Note: This search can take a long time to run if a very general SMARTS expression is used."
     ]
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "a) To carry out a mmp search"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Find all MMPs of a input/query compound to the compounds in the db. You imagine using this search to identify analogues with single point changes\n",
      "\n",
      "Use search type: mmp\n",
      "\n",
      "Format of input file (space or comma separated. The ID field is optional): SMILES ID \n",
      "\n",
      "Format of output: SMILES_QUERY,SMILES_OF_MMP,QUERY_ID,RETRIEVED_ID,CHANGED_SMILES,CONTEXT_SMILES\n"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!head sample_db_input_smi.txt"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "depict(\"c1cc2c(ncnc2NCc2cccnc2)s1\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!python $RDBASE/Contrib/mmpa/search_mmp_db.py -t mmp <sample_db_input_smi.txt"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": true,
     "input": [
      "showLine(\"c1cc2c(ncnc2NCc2cccnc2)s1,c1cc2c(ncnc2NCc2ccccc2)s1,2531831,2139597,[*:1]c1ccccc1,[*:1]CNc1ncnc2sccc21\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "b) To carry out a LHS transform substructure search:"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Find all MMPs in the db where the LHS of the transform matches an input substructure. Make sure the attached points are denated by an asterisk and the input substructure has been **canonicalised** (eg. [*]c1ccccc1). Note: Up to 3 attachement points are allowed.\n",
      "\n",
      "Use search type: subs\n",
      "\n",
      "Format of input file (space or comma separated. The ID field is optional): Substructure_SMILES ID\n",
      "\n",
      "Format of output: Input_substructure[,input_id],SMILES_MMP1,SMILES_MMP2,MMP1_ID,MMP2_ID,Transform,Context"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!head sample_db_input_subs.txt"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "depict(\"[*]c1ccccc1\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!python $RDBASE/Contrib/mmpa/search_mmp_db.py -t subs <sample_db_input_subs.txt "
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "showMMPs(\"[*]c1ccccc1,c1cc2c(ncnc2NCc2ccccc2)s1,c1cc2c(ncnc2NCc2cccnc2)s1,2139597,2531831,[*:1]c1ccccc1>>[*:1]c1cccnc1,[*:1]CNc1ncnc2sccc21\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "c) To carry out a transform search:"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Find all MMPs that match the input transform/SMIRKS. Make sure the input SMIRKS has been **canonicalised using the cansmirk.py program**.\n",
      "\n",
      "Use search type: trans\n",
      "\n",
      "Format of input file (space or comma separated. The ID field is optional): SMIRKS ID\n",
      "\n",
      "Format of output: [input_id,]SMILES_MMP1,SMILES_MMP2,MMP1_ID,MMP2_ID,Transform,Context\n"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!head sample_db_input_trans.txt"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "depict(\"[*:1]c1ccccc1>>[*:1]c1cccnc1\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!python $RDBASE/Contrib/mmpa/search_mmp_db.py -t trans <sample_db_input_trans.txt"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "showMMPs(\"t1,c1cc2c(ncnc2NCc2ccccc2)s1,c1cc2c(ncnc2NCc2cccnc2)s1,2139597,2531831,[*:1]c1ccccc1>>[*:1]c1cccnc1,[*:1]CNc1ncnc2sccc21\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "d) To carry out a LHS transform substructure SMARTS search:"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Find all MMPs in the db where the LHS of the transform matches an input SMARTS. **The attachment points in the SMARTS can be denoted by [#0]** (eg. [#0]c1ccccc1).\n",
      "\n",
      "Use search type: subs_smarts\n",
      "\n",
      "Format of input file (space or comma separated. The ID field is optional): SMARTS ID\n",
      "\n",
      "Format of output: [input_id,]SMILES_MMP1,SMILES_MMP2,MMP1_ID,MMP2_ID,Transform,Context\n"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!head sample_db_input_subs_smarts.txt"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!python $RDBASE/Contrib/mmpa/search_mmp_db.py -t subs_smarts <sample_db_input_subs_smarts.txt"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "showMMPs(\"a,NC(=O)c1ccc(NC(=O)C2COc3ccccc3O2)cc1,O=C(O)c1ccc(NC(=O)C2COc3ccccc3O2)cc1,2787356,2881039,[*:1]c1ccc(C(N)=O)cc1>>[*:1]c1ccc(C(=O)O)cc1,[*:1]NC(=O)C1COc2ccccc2O1\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "showMMPs(\"a,c1cc2c(ncnc2NCc2ccccc2)s1,c1cc2c(ncnc2NCc2cccnc2)s1,2139597,2531831,[*:1]c1ccccc1>>[*:1]c1cccnc1,[*:1]CNc1ncnc2sccc21\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "e) To carry out a transform SMARTS search:"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Find all MMPs that match the LHS and RHS SMARTS of the input transform. The transform SMARTS are input as **LHS_SMARTS>>RHS_SMARTS** (eg. [#0]c1ccccc1>>[#0]c1ccncc1). Note: This search can take a long time to run if a very general SMARTS expression is used.\n",
      "\n",
      "Use search type: trans_smarts\n",
      "\n",
      "Format of input file (space or comma separated. The ID field is optional): SMARTS ID \n",
      "\n",
      "Format of output: input_transform_SMARTS,[input_id,]SMILES_MMP1,SMILES_MMP2,MMP1_ID,MMP2_ID,Transform,Context\n"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!head sample_db_input_trans_smarts.txt"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!python $RDBASE/Contrib/mmpa/search_mmp_db.py -t trans_smarts <sample_db_input_trans_smarts.txt"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "showMMPs(\"c>>n,ts,c1cc2c(ncnc2NCc2ccccc2)s1,c1cc2c(ncnc2NCc2cccnc2)s1,2139597,2531831,[*:1]NCc1ccccc1>>[*:1]NCc1cccnc1,[*:1]c1ncnc2sccc21\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    }
   ],
   "metadata": {}
  }
 ]
}