{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "I noticed the mention of [Intake](https://github.com/intake/intake) an anaconda [blog post](https://www.anaconda.com/intake-taking-the-pain-out-of-data-access/) last year and filed it away as: \"oooo, that's interesting, I should look into that at some point.\" I really like the idea of having a straightforward Python-based system for creating and working with a data catalog; I've got way too many random datasets floating around that I use for various things and I think Intake could help bring some order to that.\n", "\n", "Today I decided to celebrate Friday by doing a bit of tech exploration and picked Intake. After going quickly through [the documentation](http://intake.readthedocs.io/en/latest) and making sure the library is still alive (i.e. active on github), I dove in.\n", "\n", "Of course what I'm most interested in doing is figuring out how to add chemistry file formats to Intake. So after some initial experimentation to make sure I can use the library at all and that it's actually as straightforward as it seems, I did the obvious first thing and added an SDF loader. \n", "\n", "This post is mostly about that. Well, that and a short demo of using Intake.\n", "\n", "I'm going to create a data catalog that contains three different data sources, all connected to the same dataset: a collection of compounds with HeRG data that came from ChEMBL. I'm not 100% sure where this dataset came from (see, I need something to help with this!), but I *think* it's from this [blog post](http://rdkit.blogspot.com/2014/08/mmpa-on-chembl-herg-data-using-pandas.html). I wanted multiple linked source files, so in addition to the original file, I created a CSV file with calculated 2D descriptors and an SDF with 3D structures of the molecules and some 3D descriptors.\n", "\n", "Let's get started.\n", "\n", "In order to use the code in this notebook, you will need to install intake and hvplot. I'm using conda (of course), so I got intake from conda-forge and used the hvplot version from the default channel:\n", "```\n", "conda install -c conda-forge intake\n", "conda install hvplot\n", "```" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "application/javascript": [ "\n", "(function(root) {\n", " function now() {\n", " return new Date();\n", " }\n", "\n", " var force = true;\n", "\n", " if (typeof root._bokeh_onload_callbacks === \"undefined\" || force === true) {\n", " root._bokeh_onload_callbacks = [];\n", " root._bokeh_is_loading = undefined;\n", " }\n", "\n", " var JS_MIME_TYPE = 'application/javascript';\n", " var HTML_MIME_TYPE = 'text/html';\n", " var EXEC_MIME_TYPE = 'application/vnd.bokehjs_exec.v0+json';\n", " var CLASS_NAME = 'output_bokeh rendered_html';\n", "\n", " /**\n", " * Render data to the DOM node\n", " */\n", " function render(props, node) {\n", " var script = document.createElement(\"script\");\n", " node.appendChild(script);\n", " }\n", "\n", " /**\n", " * Handle when an output is cleared or removed\n", " */\n", " function handleClearOutput(event, handle) {\n", " var cell = handle.cell;\n", "\n", " var id = cell.output_area._bokeh_element_id;\n", " var server_id = cell.output_area._bokeh_server_id;\n", " // Clean up Bokeh references\n", " if (id != null && id in Bokeh.index) {\n", " Bokeh.index[id].model.document.clear();\n", " delete Bokeh.index[id];\n", " }\n", "\n", " if (server_id !== undefined) {\n", " // Clean up Bokeh references\n", " var cmd = \"from bokeh.io.state import curstate; print(curstate().uuid_to_server['\" + server_id + \"'].get_sessions()[0].document.roots[0]._id)\";\n", " cell.notebook.kernel.execute(cmd, {\n", " iopub: {\n", " output: function(msg) {\n", " var id = msg.content.text.trim();\n", " if (id in Bokeh.index) {\n", " Bokeh.index[id].model.document.clear();\n", " delete Bokeh.index[id];\n", " }\n", " }\n", " }\n", " });\n", " // Destroy server and session\n", " var cmd = \"import bokeh.io.notebook as ion; ion.destroy_server('\" + server_id + \"')\";\n", " cell.notebook.kernel.execute(cmd);\n", " }\n", " }\n", "\n", " /**\n", " * Handle when a new output is added\n", " */\n", " function handleAddOutput(event, handle) {\n", " var output_area = handle.output_area;\n", " var output = handle.output;\n", "\n", " // limit handleAddOutput to display_data with EXEC_MIME_TYPE content only\n", " if ((output.output_type != \"display_data\") || (!output.data.hasOwnProperty(EXEC_MIME_TYPE))) {\n", " return\n", " }\n", "\n", " var toinsert = output_area.element.find(\".\" + CLASS_NAME.split(' ')[0]);\n", "\n", " if (output.metadata[EXEC_MIME_TYPE][\"id\"] !== undefined) {\n", " toinsert[toinsert.length - 1].firstChild.textContent = output.data[JS_MIME_TYPE];\n", " // store reference to embed id on output_area\n", " output_area._bokeh_element_id = output.metadata[EXEC_MIME_TYPE][\"id\"];\n", " }\n", " if (output.metadata[EXEC_MIME_TYPE][\"server_id\"] !== undefined) {\n", " var bk_div = document.createElement(\"div\");\n", " bk_div.innerHTML = output.data[HTML_MIME_TYPE];\n", " var script_attrs = bk_div.children[0].attributes;\n", " for (var i = 0; i < script_attrs.length; i++) {\n", " toinsert[toinsert.length - 1].firstChild.setAttribute(script_attrs[i].name, script_attrs[i].value);\n", " }\n", " // store reference to server id on output_area\n", " output_area._bokeh_server_id = output.metadata[EXEC_MIME_TYPE][\"server_id\"];\n", " }\n", " }\n", "\n", " function register_renderer(events, OutputArea) {\n", "\n", " function append_mime(data, metadata, element) {\n", " // create a DOM node to render to\n", " var toinsert = this.create_output_subarea(\n", " metadata,\n", " CLASS_NAME,\n", " EXEC_MIME_TYPE\n", " );\n", " this.keyboard_manager.register_events(toinsert);\n", " // Render to node\n", " var props = {data: data, metadata: metadata[EXEC_MIME_TYPE]};\n", " render(props, toinsert[toinsert.length - 1]);\n", " element.append(toinsert);\n", " return toinsert\n", " }\n", "\n", " /* Handle when an output is cleared or removed */\n", " events.on('clear_output.CodeCell', handleClearOutput);\n", " events.on('delete.Cell', handleClearOutput);\n", "\n", " /* Handle when a new output is added */\n", " events.on('output_added.OutputArea', handleAddOutput);\n", "\n", " /**\n", " * Register the mime type and append_mime function with output_area\n", " */\n", " OutputArea.prototype.register_mime_type(EXEC_MIME_TYPE, append_mime, {\n", " /* Is output safe? */\n", " safe: true,\n", " /* Index of renderer in `output_area.display_order` */\n", " index: 0\n", " });\n", " }\n", "\n", " // register the mime type if in Jupyter Notebook environment and previously unregistered\n", " if (root.Jupyter !== undefined) {\n", " var events = require('base/js/events');\n", " var OutputArea = require('notebook/js/outputarea').OutputArea;\n", "\n", " if (OutputArea.prototype.mime_types().indexOf(EXEC_MIME_TYPE) == -1) {\n", " register_renderer(events, OutputArea);\n", " }\n", " }\n", "\n", " \n", " if (typeof (root._bokeh_timeout) === \"undefined\" || force === true) {\n", " root._bokeh_timeout = Date.now() + 5000;\n", " root._bokeh_failed_load = false;\n", " }\n", "\n", " var NB_LOAD_WARNING = {'data': {'text/html':\n", " \"
\\n\"+\n", " \"BokehJS does not appear to have successfully loaded. If loading BokehJS from CDN, this \\n\"+\n", " \"may be due to a slow or bad network connection. Possible fixes:\\n\"+\n", " \"
\\n\"+\n", " \"\\n\"+\n",
" \"from bokeh.resources import INLINE\\n\"+\n",
" \"output_notebook(resources=INLINE)\\n\"+\n",
" \"
\\n\"+\n",
" \"\"),e=0;e<7;e++)n.push(' | '+y(t,e,!0)+\" | \");return\"
---|
\n", " | NPR1 | \n", "NPR2 | \n", "SASA | \n", "asphericity | \n", "atom.dprop.EHTCharge | \n", "eccentricity | \n", "mol | \n", "molregno | \n", "
---|---|---|---|---|---|---|---|---|
0 | \n", "0.344558 | \n", "0.790223 | \n", "474.773633 | \n", "0.295012 | \n", "-0.65766151302259046 0.28616869584385629 0.058... | \n", "0.938765 | \n", "<rdkit.Chem.rdchem.Mol object at 0x7f2baa1c3170> | \n", "29272 | \n", "
1 | \n", "0.174056 | \n", "0.953789 | \n", "312.566799 | \n", "0.570840 | \n", "-0.65058437663054747 0.29115551530406947 0.036... | \n", "0.984736 | \n", "<rdkit.Chem.rdchem.Mol object at 0x7f2baa1c33f0> | \n", "29758 | \n", "
2 | \n", "0.063609 | \n", "0.975599 | \n", "446.033757 | \n", "0.822029 | \n", "-0.65604834706359316 0.28231130455311593 0.045... | \n", "0.997975 | \n", "<rdkit.Chem.rdchem.Mol object at 0x7f2baa1c35d0> | \n", "29449 | \n", "
3 | \n", "0.284320 | \n", "0.803002 | \n", "473.661314 | \n", "0.376431 | \n", "-0.70531069101833488 0.30094745736190931 0.077... | \n", "0.958730 | \n", "<rdkit.Chem.rdchem.Mol object at 0x7f2baa1c3710> | \n", "29244 | \n", "
4 | \n", "0.253028 | \n", "0.828362 | \n", "520.615556 | \n", "0.424006 | \n", "-0.70345551365543968 0.29596092056375445 0.061... | \n", "0.967459 | \n", "<rdkit.Chem.rdchem.Mol object at 0x7f2baa1c34e0> | \n", "29265 | \n", "
\n", " | canonical_smiles | \n", "molregno | \n", "activity_id | \n", "standard_value | \n", "standard_units | \n", "MolLogP | \n", "TPSA | \n", "NumRotatableBonds | \n", "MolWt | \n", "NumHAcceptors | \n", "NumHDonors | \n", "NPR1 | \n", "NPR2 | \n", "SASA | \n", "asphericity | \n", "atom.dprop.EHTCharge | \n", "eccentricity | \n", "mol | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "N[C@@H]([C@@H]1CC[C@H](CC1)NS(=O)(=O)c2ccc(F)c... | \n", "29272 | \n", "671631 | \n", "49000.0 | \n", "nM | \n", "1.6996 | \n", "92.5 | \n", "5 | \n", "419.469 | \n", "4 | \n", "2 | \n", "0.344558 | \n", "0.790223 | \n", "474.773633 | \n", "0.295012 | \n", "-0.65766151302259046 0.28616869584385629 0.058... | \n", "0.938765 | \n", "<rdkit.Chem.rdchem.Mol object at 0x7f2ba8e4ee90> | \n", "
1 | \n", "N[C@@H]([C@@H]1CC[C@H](CC1)NS(=O)(=O)c2ccc(F)c... | \n", "29272 | \n", "671631 | \n", "49000.0 | \n", "nM | \n", "1.6996 | \n", "92.5 | \n", "5 | \n", "419.469 | \n", "4 | \n", "2 | \n", "0.344558 | \n", "0.790223 | \n", "474.773633 | \n", "0.295012 | \n", "-0.65766151302259046 0.28616869584385629 0.058... | \n", "0.938765 | \n", "<rdkit.Chem.rdchem.Mol object at 0x7f2ba8555490> | \n", "
2 | \n", "N[C@@H]([C@@H]1CC[C@H](CC1)NS(=O)(=O)c2ccc(F)c... | \n", "29272 | \n", "671631 | \n", "49000.0 | \n", "nM | \n", "1.6996 | \n", "92.5 | \n", "5 | \n", "419.469 | \n", "4 | \n", "2 | \n", "0.344558 | \n", "0.790223 | \n", "474.773633 | \n", "0.295012 | \n", "-0.65766151302259046 0.28616869584385629 0.058... | \n", "0.938765 | \n", "<rdkit.Chem.rdchem.Mol object at 0x7f2ba8e4ee90> | \n", "
3 | \n", "N[C@@H]([C@@H]1CC[C@H](CC1)NS(=O)(=O)c2ccc(F)c... | \n", "29272 | \n", "671631 | \n", "49000.0 | \n", "nM | \n", "1.6996 | \n", "92.5 | \n", "5 | \n", "419.469 | \n", "4 | \n", "2 | \n", "0.344558 | \n", "0.790223 | \n", "474.773633 | \n", "0.295012 | \n", "-0.65766151302259046 0.28616869584385629 0.058... | \n", "0.938765 | \n", "<rdkit.Chem.rdchem.Mol object at 0x7f2ba8555490> | \n", "
4 | \n", "N[C@@H]([C@@H]1CC[C@H](CC1)NS(=O)(=O)c2ccc(F)c... | \n", "29272 | \n", "2606579 | \n", "49000.0 | \n", "nM | \n", "1.6996 | \n", "92.5 | \n", "5 | \n", "419.469 | \n", "4 | \n", "2 | \n", "0.344558 | \n", "0.790223 | \n", "474.773633 | \n", "0.295012 | \n", "-0.65766151302259046 0.28616869584385629 0.058... | \n", "0.938765 | \n", "<rdkit.Chem.rdchem.Mol object at 0x7f2ba8e4ee90> | \n", "