{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Using Django ORM to access the ChEMBL DB without SQL" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### myChEMBL team, ChEMBL group, EMBL-EBI." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "ORM stands from [Object Relational Mapping](https://en.wikipedia.org/wiki/Object-relational_mapping). It is a technology that allows access to a database in a native object oriented manner. Using ORM has several advantages:\n", " 1. You don't have to put SQL strings into your code\n", " 2. Your code is no longer tied to a specific database backend (such us postgres)\n", " 3. Code written using ORM is more intuitive\n", "\n", "There is already one ORM interface for ChEMBL, called [pychembldb](https://github.com/kzfm/pychembldb). It's based on the popular [SQLAlchemy](http://www.sqlalchemy.org/) library. To complete this picture, we provide ORM models based on [Django](https://www.djangoproject.com/). This notebook presents how complex DB queries can be built using Django ORM within a IPython Notebook.\n", "\n", "Django models described here are used in ChEMBL production code, for example in ChEMBL web services. As a result, the models can be loaded into myChEMBL, which uses postgres and they still work in ChEMBL production environment, where Oracle is used. The same codebase works against two different DB engines without modifying a single line of code.\n", "\n", "Please note that this notebook covers only most basic usage of Django ORM in the context of ChEMBL data.\n", "For more comprehensive description and advanced usage, read [Making Queries Django Tutorial](https://docs.djangoproject.com/en/dev/topics/db/queries/) and [QuerySet API reference](https://docs.djangoproject.com/en/dev/ref/models/querysets/)." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "DEBUG = False\r\n", "\r\n", "DATABASES = {\r\n", " 'default': {\r\n", " 'ENGINE': 'django.db.backends.postgresql_psycopg2', # Add 'postgresql_psycopg2', 'mysql', 'sqlite3' or 'oracle'.\r\n", " 'NAME': 'chembl_21', # Or path to database file if using sqlite3.\r\n", " 'USER': 'chembl', # Not used with sqlite3.\r\n", " 'PASSWORD': '', # Not used witis oracle\r\n", " 'HOST': '', # Set to empty string for localhost. Not used with sqlite3.\r\n", " 'PORT': '5432', # Set to empty string for default. Not used with sqlite3.\r\n", " },\r\n", "}\r\n", "\r\n", "EXPORT_MODE = True\r\n", "CORE_TABLES_MANAGED = False\r\n", "APP_SPECIFIC_TABLES_MANAGED = False\r\n", "COMPOUND_MOLS_TABLE = 'mols_rdkit'\r\n", "CTAB_COLUMN = 'm'\r\n", "\r\n", "SECRET_KEY = '3v2xb&@&_kibf0o!4m249njy3!qjxptht0m%q2w&ry8v&ok$na'\r\n", "\r\n", "INSTALLED_APPS = (\r\n", " 'chembl_core_db',\r\n", " 'chembl_core_model',\r\n", " )\r\n" ] } ], "source": [ "# In order to use django ORM we have to provide a settings module. This module must be kept in a separate file.\n", "# A minimal settings.py contains database connection details, logging and caching configuration as well as a list of \n", "# applications (python packages) defining models we are planning to use. Let's see how our settings.py file for this\n", "# notebook will look like:\n", "!cat /home/chembl/ipynb_workbench/settings.py" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'settings'" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# In order to configure django to use our settings we have to append the module containing it to system path and set \n", "# DJANGO_SETTINGS_MODULE environment variable.\n", "import os\n", "import sys\n", "sys.path.append('/home/chembl/ipynb_workbench')\n", "os.environ.setdefault(\"DJANGO_SETTINGS_MODULE\", \"settings\")\n", "# That's it! We are ready to use our models." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ActionType\n", "Activities\n", "ActivityStdsLookup\n", "AssayParameters\n", "AssayType\n", "Assays\n", "AtcClassification\n", "BindingSites\n", "BioComponentSequences\n", "BioassayOntology\n", "BiotherapeuticComponents\n", "Biotherapeutics\n", "CellDictionary\n", "ChemblIdLookup\n", "ComponentClass\n", "ComponentDomains\n", "ComponentGo\n", "ComponentSequences\n", "ComponentSynonyms\n", "CompoundProperties\n", "CompoundRecords\n", "CompoundStructuralAlerts\n", "CompoundStructures\n", "ConfidenceScoreLookup\n", "CurationLookup\n", "DataValidityLookup\n", "DefinedDailyDose\n", "Docs\n", "Domains\n", "DrugIndication\n", "DrugMechanism\n", "Formulations\n", "FracClassification\n", "GoClassification\n", "HracClassification\n", "IndicationRefs\n", "IracClassification\n", "LigandEff\n", "MechanismRefs\n", "Metabolism\n", "MetabolismRefs\n", "MoleculeAtcClassification\n", "MoleculeDictionary\n", "MoleculeFracClassification\n", "MoleculeHierarchy\n", "MoleculeHracClassification\n", "MoleculeIracClassification\n", "MoleculeSynonyms\n", "OrganismClass\n", "ParameterType\n", "PatentUseCodes\n", "PredictedBindingDomains\n", "ProductPatents\n", "Products\n", "ProteinClassSynonyms\n", "ProteinClassification\n", "ProteinFamilyClassification\n", "RelationshipType\n", "ResearchCompanies\n", "ResearchStem\n", "SiteComponents\n", "Source\n", "StructuralAlertSets\n", "StructuralAlerts\n", "TargetComponents\n", "TargetDictionary\n", "TargetRelations\n", "TargetType\n", "UsanStems\n", "Version\n" ] } ], "source": [ "# First, lets import our ChEMBL ORM models. The module is called 'chembl_migration_model' because it describes ChEMBL DB\n", "# schema, that is used in my_ChEMBL VM and in public SQL dumps. Internally, ChEMBL group is working on slightly more complex\n", "# schema, containing more tables and columns.\n", "import chembl_migration_model\n", "\n", "# OK, now we can take a look at all available models:\n", "print '\\n'.join([model for model in dir(chembl_migration_model.models) if not model.startswith('__')])\n", "\n", "# As you see each model corresponds to single table in ChEMBL DB. The name of the model is the name of the corresponding table,\n", "# but without underscores and written in camel case, i.e. ASSAY_PARAMETERS (DB table) => AssayParameters (ORM model)\n", "\n", "# For convenience we will import model names into default namespece:\n", "from chembl_migration_model.models import *\n", "\n", "# We need CompoundMols model to perform substructure and similarity search:\n", "from chembl_core_model.models import CompoundMols" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Getting objects count" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "1592191" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# For starters, lets see how many unique compounds we currently have at ChEMBL:\n", "MoleculeDictionary.objects.count()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "11019" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# How about tagets?\n", "TargetDictionary.objects.count()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "1212831" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Assays?\n", "Assays.objects.count()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Filtering objects" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "33\n" ] } ], "source": [ "# We will learn filtering on CompoundPoperties model. First let's get all CompoundProperties objects with molecular\n", "# formula = C9H8O4\n", "compounds = CompoundProperties.objects.filter(full_molformula=\"C9H8O4\")\n", "# how many of them exist?\n", "print compounds.count()" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1.23\n", "1\n", "180.16\n" ] } ], "source": [ "# OK, let's take the first object:\n", "compound = compounds[0]\n", "# We can access all properties of this compound, for example alogp, numner of aromatic rings, molecular weight, etc:\n", "print compound.alogp\n", "print compound.aromatic_rings\n", "print compound.full_mwt" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[(4, 1, Decimal('63.60')), (4, 3, Decimal('77.76')), (4, 0, Decimal('60.44')), (4, 1, Decimal('63.60')), (4, 2, Decimal('74.59')), (4, 1, Decimal('63.60')), (4, 2, Decimal('74.60')), (4, 2, Decimal('74.60')), (4, 1, Decimal('63.60')), (4, 1, Decimal('63.60')), (4, 1, Decimal('63.60')), (4, 1, Decimal('55.76')), (4, 1, Decimal('55.76')), (4, 1, Decimal('55.76')), (4, 1, Decimal('55.76')), (4, 1, Decimal('55.76')), (4, 1, Decimal('55.76')), (4, 3, Decimal('77.75')), (4, 2, Decimal('66.75')), (4, 2, Decimal('74.59')), '...(remaining elements truncated)...']\n" ] } ], "source": [ "# What if we would like for a result to be a list of certain fields not a collection of whole objects?\n", "results = CompoundProperties.objects.filter(full_molformula=\"C9H8O4\").values_list('hba', 'hbd', 'psa')\n", "print results" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2\n", "[Decimal('180.16'), Decimal('180.16')]\n" ] } ], "source": [ "# Perfect, can I include many conditions in filters?\n", "\n", "# Yes! Following query will get only structures with formula = C9H8O4, hbd > 1 and psa between (60 and 70):\n", "compounds = CompoundProperties.objects.filter(full_molformula=\"C9H8O4\", hbd__gt=1, psa__range=(60.0, 70.0))\n", "print compounds.count()\n", "print compounds.values_list('full_mwt', flat=True)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "142903\n", "9082\n" ] } ], "source": [ "# Additionally to 'filter' there is 'exclude' method as well:\n", "\n", "# Let's select all structures, without aromatic rings:\n", "non_aromatic = CompoundProperties.objects.exclude(aromatic_rings__gt=0)\n", "print non_aromatic.count()\n", "\n", "# Having non-aromatic structures, can we apply more fileters only to them?\n", "\n", "# Yes, we can chain 'filter' and 'exclude' methods:\n", "\n", "light_non_aromatic = non_aromatic.filter(full_mwt__lte=180.0)\n", "print light_non_aromatic.count()" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "614855\n" ] } ], "source": [ "# Great, but so har we've only seen a conjunction of filters, what if I would like to have alternative?\n", "\n", "# How would I select all structures with 2 aromatic rings OR with formula = C9H8O4?\n", "# This is where we need django helper operator called 'Q' (stands from Query):\n", "from django.db.models import Q\n", "\n", "# Now using Q, we can ask for alternative:\n", "aromatic_or_formula = CompoundProperties.objects.filter(Q(aromatic_rings__gt=2) | Q(full_molformula=\"C9H8O4\"))\n", "print aromatic_or_formula.count()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Distinct values and ordering" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[None, 8, 12, 17, 19, 2, 1, 15, 3, 10, 11, 4, 5, 13, 9, 0, 7, 6]\n" ] } ], "source": [ "# I would like to see all distinct avlues of the numer of aromatics rings across all compound structures:\n", "\n", "print CompoundProperties.objects.values_list('aromatic_rings', flat=True).distinct()" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 15, 17, 19, None]\n" ] } ], "source": [ "# Hmm, the result is not ordered...\n", "\n", "print CompoundProperties.objects.values_list('aromatic_rings', flat=True).distinct().order_by('aromatic_rings')" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[None, 19, 17, 15, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0]\n" ] } ], "source": [ "# OK, but I need descending order...\n", "\n", "print CompoundProperties.objects.values_list('aromatic_rings', flat=True).distinct().order_by('-aromatic_rings')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Limiting results" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "10\n", "13020.30\n", "C612H1104N39O198P21S6\n", "(u'C612H1104N39O198P21S6', Decimal('13020.30'))\n", "(u'C396H390F252N66O24P42', Decimal('12546.26'))\n", "(u'C540H960N39O198P21S6', Decimal('12010.38'))\n", "(u'C504H888N39O198P21S6', Decimal('11505.43'))\n", "(u'C324H318F252N42P42', Decimal('10888.77'))\n", "(u'C468H828N27O198P21S6', Decimal('10844.48'))\n", "(u'C480H828N27O186P21S6', Decimal('10796.62'))\n", "(u'C325H387N118O209P29', Decimal('10188.45'))\n", "(u'C319H383N118O209P29', Decimal('10112.36'))\n", "(u'C318H382N119O208P29', Decimal('10097.35'))\n" ] } ], "source": [ "# I want just a ranking of 10 most heavy structures from ChEMBL:\n", "\n", "heaviest = CompoundProperties.objects.order_by('-full_mwt')[0:10]\n", "print heaviest.count()\n", "\n", "# And the winner is:\n", "winner = heaviest[0]\n", "print winner.full_mwt\n", "print winner.full_molformula\n", "\n", "# Lets get a list of molformulas and wiights:\n", "print '\\n'.join(map(str, heaviest.values_list('full_molformula', 'full_mwt')))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Checking for existence" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Let's check if there is a compound with 2 aromatic rings and weight lesst than 200:\n", "\n", "CompoundProperties.objects.filter(aromatic_rings=2, full_mwt__lt=200.0).exists()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Getting single object" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ETHOXZOLAMIDE\n" ] } ], "source": [ "# Sometimes we want to be sure that only a single object has been selected, \n", "# this time let's take MoleculeDictionary as the example:\n", "\n", "chembl_18 = MoleculeDictionary.objects.get(chembl_id='CHEMBL18')\n", "print chembl_18.pref_name" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "does not exist\n", "too many\n" ] } ], "source": [ "# 'get' method will throw an exception in case when no results or more than one results were found:\n", "\n", "try:\n", " imaginary = MoleculeDictionary.objects.get(chembl_id='CHEMBLX')\n", "except MoleculeDictionary.DoesNotExist:\n", " print 'does not exist'\n", " \n", "try:\n", " too_many = MoleculeDictionary.objects.get(prodrug=True)\n", "except MoleculeDictionary.MultipleObjectsReturned:\n", " print 'too many'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Chemistry stuff - substructure and similarity search" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The important feature of every chemistry database it the ability to perform substructure and similarity search. By default, Django is copletely chemistry unaware but 'chembl_core_model' brings this functionality." ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "4\n", "(1280, 1.0)\n", "(271022, 0.857142857142857)\n", "(289908, 0.7)\n", "(875057, 0.7)\n" ] } ], "source": [ "# We can perform similarity search using CompoundMols object:\n", "similar = CompoundMols.objects.similar_to('CC(=O)Oc1ccccc1C(=O)O', 70)\n", "print similar.count()\n", "\n", "# Due to some strane bug in Django we always have to mantion additional 'similarity' field but this is even better\n", "# because this field is so useful :)\n", "print '\\n'.join(map(str, similar.values_list('molecule_id', 'similarity')))" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "73\n" ] } ], "source": [ "# substructure search works the same way:\n", "sub = CompoundMols.objects.with_substructure('CN(CCCN)c1cccc2ccccc12')\n", "print sub.count()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Related objects - joins" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The whole idea of having relational database is the ability to join related objecs together.\n", "Let's see how this can be achieved using ORM." ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(1280, 1.0, u'ASPIRIN')\n", "(271022, 0.857142857142857, u'DIPLOSALSALATE')\n", "(289908, 0.7, None)\n", "(875057, 0.7, u'DIPYROCETYL')\n" ] } ], "source": [ "# First of all let's join our substructure and similarity search results with other objects to get some useful information.\n", "# For example, we will get CHEMBL_ID and preffered name of similarity results.\n", "# Here we join CompouldMols objects (`similar` variable) with `pref_name` field of MoleculeDictionary\n", "# Double underscore before field name means we are accessing related object:\n", "print '\\n'.join(map(str, similar.values_list('molecule_id', 'similarity', 'molecule__pref_name')))" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(u'C21H19N5O3', Decimal('389.41'))\n", "(u'C23H17N3O5', Decimal('415.40'))\n", "(u'C24H22N2O2', Decimal('370.44'))\n", "(u'C34H37N5O5', Decimal('595.69'))\n", "(u'C34H37N5O5', Decimal('595.69'))\n", "(u'C19H19N3O5', Decimal('369.37'))\n", "(u'C29H24N4O3', Decimal('476.53'))\n", "(u'C25H23N5O3', Decimal('441.48'))\n", "(u'C26H24N4O3', Decimal('440.49'))\n", "(u'C21H22N4O3', Decimal('378.42'))\n" ] } ], "source": [ "# And molecular formula and molecular weight of first 10 of substructure results:\n", "print '\\n'.join(map(str,sub.values_list('molecule__compoundproperties__full_molformula', 'molecule__compoundproperties__full_mwt')[0:10]))" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "8-Hour Bayer, Acetylsalicylic Acid, Aspirin, Bayer Extra Strength, Measurin, Acetosalic Acid, Acetylsalicylic Acid, Ecotrin, Equi-Prin, Salicylic Acid Acetate, Aspirin, Aspirin, Aspirin, Aspirin, Aspirin, Acetylsalicylic Acid\n" ] } ], "source": [ "# Get the synonyms of aspirin:\n", "aspirin = MoleculeDictionary.objects.get(chembl_id='CHEMBL25')\n", "\n", "# When having a single object we can simply access realted objects using dot (`.`) operator:\n", "print ', '.join(aspirin.moleculesynonyms_set.values_list('synonyms', flat=True))" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "VELPATASVIR, Tideglusib, BURMANIN B, NAPIRIMUS, NAFOXADOL, CERLAPIRDINE, NARANOL, TEVERELIX, CHROMOMYCIN SA3, DEACETYL CHROMOMYCIN A3\n" ] } ], "source": [ "# Get 10 preferred names of compounds containing 'CN(CCCN)c1cccc2ccccc12' substructure:\n", "print ', '.join(CompoundMols.objects.filter(molecule__pref_name__isnull=False).with_substructure('c1cccc2ccccc12')[0:10].values_list('molecule__pref_name', flat=True))" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1377248\n", "CHEMBL113081, CHEMBL324340, CHEMBL324340, CHEMBL109600, CHEMBL357278, CHEMBL357119, CHEMBL152968, CHEMBL152968, CHEMBL152968, CHEMBL152968\n" ] } ], "source": [ "#Retrieve all the compounds which have an IC50 bioactivity value in nM:\n", "result = MoleculeDictionary.objects.filter(activities__standard_type = 'IC50', activities__standard_units = 'nM')\n", "print result.count()\n", "print ', '.join(result[0:10].values_list('chembl_id', flat=True))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.11" }, "widgets": { "state": {}, "version": "1.1.2" } }, "nbformat": 4, "nbformat_minor": 0 }