{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "name": "cupy_chembl_example.ipynb", "version": "0.3.2", "provenance": [], "collapsed_sections": [] }, "kernelspec": { "name": "python3", "display_name": "Python 3" }, "accelerator": "GPU" }, "cells": [ { "cell_type": "markdown", "metadata": { "id": "k_ZTVo0BdhJh", "colab_type": "text" }, "source": [ "# CuPy CUDA ChEMBL similarity search example\n", "\n", "If using NumPy => 1.16 PyTables > 3.44 will be required. Install PyTables 3.51 if you're running this notebook in colab. Any other dependency is already installed in colab's default env.\n", "\n", "Remember to restart the runtime after upgrading PyTables with pip!!!\n", "\n", "\n", "You will also need to download ChEMBL25 FPSim2 database file.\n", "\n", "Did you know BTW that we recently updated [FPSim2](https://github.com/chembl/FPSim2) replacing it's Cython core by C++ binded with [PyBind11](https://github.com/pybind/pybind11) with improved performance and that it's now also compatible with Windows and Python 3.7?" ] }, { "cell_type": "markdown", "metadata": { "id": "Iye5Wa4jf9MQ", "colab_type": "text" }, "source": [ "Preflight config, run this cell only the first time" ] }, { "cell_type": "code", "metadata": { "id": "v5AcWu7RfwPZ", "colab_type": "code", "colab": {} }, "source": [ "# update PyTables, you'll need to restart the environment in colab after the install!!!\n", "# !pip install tables==3.5.1\n", "\n", "# download ChEMBL25 FPSim2 FP db\n", "# !wget \"http://ftp.ebi.ac.uk/pub/databases/chembl/fpsim2/chembl_25.h5\"" ], "execution_count": 0, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "rB3HLtMr2LsA", "colab_type": "code", "colab": {} }, "source": [ "import cupy as cp\n", "import tables as tb\n", "import numpy as np" ], "execution_count": 0, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "THvf0361Bl4v", "colab_type": "text" }, "source": [ "# Load FPSim2 fingerprint database\n", "\n", "\n", "fps variable contains fingerprints (2048 bit hashed Morgan, radius 2) for all ChEMBL25 database molecules. Each row represents a molecule and it's structure is the following:\n", "\n", "\n", "```\n", "array([84419, 0,140737488355328, 17592186044416, 1024, 1099549376512, 0, 0, 0, 0, 0, 9007199254741248, 0,16777216, 0, 2305843009213693952, 0, 1073741824, 0, 0, 2199023255552, 0, 0, 0, 0, 0, 0, 32, 0, 0, 34359738372, 0, 0, 15], dtype=uint64)\n", "```\n", "First array's element is the ChEMBL molregno and last one is the count of ON bits in it's fingerprint (popcount). The 32 values in between are the 2048 fingerprint bits grouped as 64bit unsigned integers.\n", "\n", "Molecules in FP db are sorted by popcount, which is needed to apply the bounds for sublinear time found in this classic paper: [10.1021/ci600358f](https://doi.org/10.1021/ci600358f)\n" ] }, { "cell_type": "code", "metadata": { "id": "wyJdmMupFXl2", "colab_type": "code", "colab": {} }, "source": [ "# using same FPsim2 ChEMBL FP database :)\n", "fp_filename = \"chembl_25.h5\"\n", "with tb.open_file(fp_filename, mode=\"r\") as fp_file:\n", " fps = fp_file.root.fps[:]\n", " num_fields = len(fps[0])\n", " fps = fps.view(\"u8\")\n", " fps = fps.reshape(int(fps.size / num_fields), num_fields)\n", " # we'll use popcnt_ranges for the bounds optimisaiton, it stores \n", " # the ranges for each popcount in the database\n", " popcnt_ranges = fp_file.root.config[3]" ], "execution_count": 0, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "nnjGuUnGFs41", "colab_type": "code", "colab": {} }, "source": [ "# aspirin, ChEMBL molregno 1280\n", "query_molregno = 1280" ], "execution_count": 0, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "GTXWEq77Evb1", "colab_type": "text" }, "source": [ "# Let's try the ElementWise kernel\n", "\n", "CuPy's [ElementWise](https://docs-cupy.chainer.org/en/stable/tutorial/kernel.html#basics-of-elementwise-kernels) kernel will apply the same operation for each row. This makes sense to us because we would like to calc similarity for all molecules in the FP db file for given a query molecule .\n", "\n", "You probably noticed that we are using **i** variable which is not declared in the code... this is a special variable that indicates the index within the loop\n", "\n", "__popcll is a GPU instruction similar to the ones found in CPU which efficiently counts the number of 1's in a bit array" ] }, { "cell_type": "code", "metadata": { "id": "5qpH8XKIEuO-", "colab_type": "code", "colab": {} }, "source": [ "taniEW = cp.ElementwiseKernel(\n", " in_params=\"raw T db, raw U query, uint64 in_width, float32 threshold\",\n", " out_params=\"raw V out\",\n", " operation=r\"\"\"\n", " int comm_sum = 0;\n", " for(int j = 1; j < in_width - 1; ++j){\n", " int pos = i * in_width + j;\n", " comm_sum += __popcll(db[pos] & query[j]);\n", " }\n", " float coeff = 0.0;\n", " coeff = query[in_width - 1] + db[i * in_width + in_width - 1] - comm_sum;\n", " if (coeff != 0.0)\n", " coeff = comm_sum / coeff;\n", " out[i] = coeff >= threshold ? coeff : 0.0;\n", " \"\"\",\n", " name='taniEW',\n", " options=('-std=c++14',),\n", " reduce_dims=False\n", ")" ], "execution_count": 0, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "Nh1lehEWFPuV", "colab_type": "code", "colab": {} }, "source": [ "# get the query molecule from the FP database\n", "query = cp.asarray(fps[(fps[:,0] == query_molregno)][0])\n", "# copy the database to GPU\n", "database = cp.asarray(fps)\n", "\n", "def cupy_elementwise_search(db, query, threshold):\n", " # init the results variable \n", " sim = cp.zeros(database.shape[0], dtype=\"f4\")\n", " \n", " # set the threshold variable and run the search\n", " threshold = cp.asarray(threshold, dtype=\"f4\")\n", " taniEW(db, query, db.shape[1], threshold, sim, size=db.shape[0])\n", "\n", " mask = sim.nonzero()[0]\n", " np_sim = cp.asnumpy(sim[mask])\n", " np_ids = cp.asnumpy(db[:,0][mask])\n", " \n", " dtype = np.dtype([(\"mol_id\", \"u4\"), (\"coeff\", \"f4\")])\n", " results = np.empty(len(np_ids), dtype=dtype)\n", " results[\"mol_id\"] = np_ids\n", " results[\"coeff\"] = np_sim\n", " results[::-1].sort(order='coeff')\n", " return results" ], "execution_count": 0, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "gFP4HKROF5l_", "colab_type": "code", "outputId": "74d4ddb5-6ca5-461b-d311-5134110b7405", "colab": { "base_uri": "https://localhost:8080/", "height": 69 } }, "source": [ "results = cupy_elementwise_search(database, query, 0.7)\n", "results" ], "execution_count": 7, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "array([( 1280, 1. ), (2096455, 0.8888889 ),\n", " ( 271022, 0.85714287), ( 875057, 0.7 )],\n", " dtype=[('mol_id', '= threshold:\n", " range_to_screen.append(c_range)\n", " if range_to_screen:\n", " range_to_screen = (range_to_screen[0][0], \n", " range_to_screen[len(range_to_screen) - 1][1])\n", " return range_to_screen" ], "execution_count": 0, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "0ARyUwHQKaUH", "colab_type": "code", "colab": {} }, "source": [ "def cupy_elementwise_search_bounds(db, query, popcnt_ranges, threshold):\n", " \n", " # get the range of molecules to screen\n", " rk = get_tanimoto_bounds(int(query[-1]), popcnt_ranges, threshold)\n", " \n", " # set the threshold variable\n", " threshold = cp.asarray(threshold, dtype=\"f4\")\n", "\n", " # get the subset of molecule ids\n", " ids = db[:,0][slice(*rk)]\n", " subset_size = int(rk[1]-rk[0])\n", "\n", " # init the results variable\n", " sim = cp.zeros(subset_size, dtype=cp.float32)\n", "\n", " # run the search. It will compile the kernel only the first time it runs\n", " taniEW(db[slice(*rk)], query, db.shape[1], threshold, sim, size=subset_size)\n", "\n", " # get all non 0 values and ids\n", " mask = sim.nonzero()[0]\n", " np_sim = cp.asnumpy(sim[mask])\n", " np_ids = cp.asnumpy(ids[mask])\n", "\n", " # create results numpy array\n", " dtype = np.dtype([(\"mol_id\", \"u4\"), (\"coeff\", \"f4\")])\n", " results = np.empty(len(np_ids), dtype=dtype)\n", " results[\"mol_id\"] = np_ids\n", " results[\"coeff\"] = np_sim\n", " results[::-1].sort(order='coeff')\n", " return results" ], "execution_count": 0, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "bS5tt0voW-Lt", "colab_type": "code", "outputId": "e8511868-5502-4b74-f2ca-533631856a08", "colab": { "base_uri": "https://localhost:8080/", "height": 69 } }, "source": [ "results = cupy_elementwise_search_bounds(database, query, popcnt_ranges, 0.7)\n", "results" ], "execution_count": 11, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "array([( 1280, 1. ), (2096455, 0.8888889 ),\n", " ( 271022, 0.85714287), ( 875057, 0.7 )],\n", " dtype=[('mol_id', '= *threshold ? coeff : 0.0;\n", " }}\n", " }}\n", " \"\"\".format(block=database.shape[1]),\n", " name=\"taniRAW\",\n", " options=('-std=c++14',),\n", ")" ], "execution_count": 0, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "fINESKQAZWo8", "colab_type": "code", "colab": {} }, "source": [ "def cupy_sim_search_bounds(db, db_popcnts, db_ids, query, popcnt_ranges, threshold):\n", " \n", " c_query = cp.asarray(query[:,1:-1])\n", " qpopcnt = cp.asarray(query[:,-1])\n", "\n", " # get the range of the molecule subset to screen\n", " rk = get_tanimoto_bounds(int(query[:,-1]), popcnt_ranges, threshold)\n", " \n", " threshold = cp.asarray(threshold, dtype=\"f4\")\n", "\n", " # get the subset of molecule ids\n", " subset_size = int(rk[1]-rk[0])\n", " ids2 = db_ids[slice(*rk)]\n", "\n", " # init results array\n", " sim = cp.zeros(subset_size, dtype=cp.float32)\n", "\n", " # run the search, it compiles the kernel only the first time it runs\n", " # grid, block and arguments\n", " taniRAW((subset_size,), \n", " (db.shape[1],), \n", " (c_query, qpopcnt, db[slice(*rk)], db_popcnts[slice(*rk)], threshold, sim))\n", "\n", " # get all non 0 values and ids\n", " mask = sim.nonzero()[0]\n", " np_sim = cp.asnumpy(sim[mask])\n", " np_ids = cp.asnumpy(ids2[mask])\n", "\n", " # create results numpy array\n", " dtype = np.dtype([(\"mol_id\", \"u4\"), (\"coeff\", \"f4\")])\n", " results = np.empty(len(np_ids), dtype=dtype)\n", " results[\"mol_id\"] = np_ids\n", " results[\"coeff\"] = np_sim\n", " results[::-1].sort(order='coeff')\n", " return results" ], "execution_count": 0, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "Jkx39KZKZW4a", "colab_type": "code", "outputId": "5b062cd2-60b9-4c02-d9cf-6b6cb193be70", "colab": { "base_uri": "https://localhost:8080/", "height": 69 } }, "source": [ "results = cupy_sim_search_bounds(database, popcnts, ids, query, popcnt_ranges, 0.7)\n", "results" ], "execution_count": 15, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "array([( 1280, 1. ), (2096455, 0.8888889 ),\n", " ( 271022, 0.85714287), ( 875057, 0.7 )],\n", " dtype=[('mol_id', '