{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "cupy_chembl_example.ipynb",
      "version": "0.3.2",
      "provenance": [],
      "collapsed_sections": []
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "accelerator": "GPU"
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "k_ZTVo0BdhJh",
        "colab_type": "text"
      },
      "source": [
        "# CuPy CUDA ChEMBL similarity search example\n",
        "\n",
        "If using NumPy => 1.16 PyTables > 3.44 will be required. Install PyTables 3.51 if you're running this notebook in colab. Any other dependency is already installed in colab's default env.\n",
        "\n",
        "Remember to restart the runtime after upgrading PyTables with pip!!!\n",
        "\n",
        "\n",
        "You will also need to download ChEMBL25 FPSim2 database file.\n",
        "\n",
        "Did you know BTW that we recently updated [FPSim2](https://github.com/chembl/FPSim2) replacing it's Cython core by C++ binded with  [PyBind11](https://github.com/pybind/pybind11) with improved performance and that it's now also compatible with Windows and Python 3.7?"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Iye5Wa4jf9MQ",
        "colab_type": "text"
      },
      "source": [
        "Preflight config, run this cell only the first time"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "v5AcWu7RfwPZ",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# update PyTables, you'll need to restart the environment in colab after the install!!!\n",
        "# !pip install tables==3.5.1\n",
        "\n",
        "# download ChEMBL25 FPSim2 FP db\n",
        "# !wget \"http://ftp.ebi.ac.uk/pub/databases/chembl/fpsim2/chembl_25.h5\""
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "rB3HLtMr2LsA",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "import cupy as cp\n",
        "import tables as tb\n",
        "import numpy as np"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "THvf0361Bl4v",
        "colab_type": "text"
      },
      "source": [
        "# Load FPSim2 fingerprint database\n",
        "\n",
        "\n",
        "fps variable contains fingerprints (2048 bit hashed Morgan, radius 2) for all ChEMBL25 database molecules. Each row represents a molecule and it's structure is the following:\n",
        "\n",
        "\n",
        "```\n",
        "array([84419, 0,140737488355328, 17592186044416, 1024, 1099549376512, 0, 0, 0, 0, 0, 9007199254741248, 0,16777216, 0, 2305843009213693952, 0, 1073741824, 0, 0, 2199023255552, 0, 0, 0, 0, 0, 0, 32, 0, 0, 34359738372, 0, 0, 15], dtype=uint64)\n",
        "```\n",
        "First array's element is the ChEMBL molregno and last one is the count of ON bits in it's fingerprint (popcount). The 32 values in between are the 2048 fingerprint bits grouped as 64bit unsigned integers.\n",
        "\n",
        "Molecules in FP db are sorted by popcount, which is needed to apply the bounds for sublinear time found in this classic paper: [10.1021/ci600358f](https://doi.org/10.1021/ci600358f)\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "wyJdmMupFXl2",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# using same FPsim2 ChEMBL FP database :)\n",
        "fp_filename = \"chembl_25.h5\"\n",
        "with tb.open_file(fp_filename, mode=\"r\") as fp_file:\n",
        "    fps = fp_file.root.fps[:]\n",
        "    num_fields = len(fps[0])\n",
        "    fps = fps.view(\"u8\")\n",
        "    fps = fps.reshape(int(fps.size / num_fields), num_fields)\n",
        "    # we'll use popcnt_ranges for the bounds optimisaiton, it stores \n",
        "    # the ranges for each popcount in the database\n",
        "    popcnt_ranges = fp_file.root.config[3]"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "nnjGuUnGFs41",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# aspirin, ChEMBL molregno 1280\n",
        "query_molregno = 1280"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "GTXWEq77Evb1",
        "colab_type": "text"
      },
      "source": [
        "# Let's try the ElementWise kernel\n",
        "\n",
        "CuPy's [ElementWise](https://docs-cupy.chainer.org/en/stable/tutorial/kernel.html#basics-of-elementwise-kernels) kernel will apply the same operation for each row. This makes sense to us because we would like to calc similarity for all molecules in the FP db file for given a query molecule .\n",
        "\n",
        "You probably noticed that we are using **i** variable which is not declared in the code... this is a special variable that indicates the index within the loop\n",
        "\n",
        "__popcll is a GPU instruction similar to the ones found in CPU which efficiently counts the number of 1's in a bit array"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "5qpH8XKIEuO-",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "taniEW = cp.ElementwiseKernel(\n",
        "    in_params=\"raw T db, raw U query, uint64 in_width, float32 threshold\",\n",
        "    out_params=\"raw V out\",\n",
        "    operation=r\"\"\"\n",
        "    int comm_sum = 0;\n",
        "    for(int j = 1; j < in_width - 1; ++j){\n",
        "        int pos = i * in_width + j;\n",
        "        comm_sum += __popcll(db[pos] & query[j]);\n",
        "    }\n",
        "    float coeff = 0.0;\n",
        "    coeff = query[in_width - 1] + db[i * in_width + in_width - 1] - comm_sum;\n",
        "    if (coeff != 0.0)\n",
        "        coeff = comm_sum / coeff;\n",
        "    out[i] = coeff >= threshold ? coeff : 0.0;\n",
        "    \"\"\",\n",
        "    name='taniEW',\n",
        "    options=('-std=c++14',),\n",
        "    reduce_dims=False\n",
        ")"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Nh1lehEWFPuV",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# get the query molecule from the FP database\n",
        "query = cp.asarray(fps[(fps[:,0] == query_molregno)][0])\n",
        "# copy the database to GPU\n",
        "database = cp.asarray(fps)\n",
        "\n",
        "def cupy_elementwise_search(db, query, threshold):\n",
        "    # init the results variable  \n",
        "    sim = cp.zeros(database.shape[0], dtype=\"f4\")\n",
        "    \n",
        "    # set the threshold variable and run the search\n",
        "    threshold = cp.asarray(threshold, dtype=\"f4\")\n",
        "    taniEW(db, query, db.shape[1], threshold, sim, size=db.shape[0])\n",
        "\n",
        "    mask = sim.nonzero()[0]\n",
        "    np_sim = cp.asnumpy(sim[mask])\n",
        "    np_ids = cp.asnumpy(db[:,0][mask])\n",
        "    \n",
        "    dtype = np.dtype([(\"mol_id\", \"u4\"), (\"coeff\", \"f4\")])\n",
        "    results = np.empty(len(np_ids), dtype=dtype)\n",
        "    results[\"mol_id\"] = np_ids\n",
        "    results[\"coeff\"] = np_sim\n",
        "    results[::-1].sort(order='coeff')\n",
        "    return results"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "gFP4HKROF5l_",
        "colab_type": "code",
        "outputId": "74d4ddb5-6ca5-461b-d311-5134110b7405",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 69
        }
      },
      "source": [
        "results = cupy_elementwise_search(database, query, 0.7)\n",
        "results"
      ],
      "execution_count": 7,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "array([(   1280, 1.        ), (2096455, 0.8888889 ),\n",
              "       ( 271022, 0.85714287), ( 875057, 0.7       )],\n",
              "      dtype=[('mol_id', '<u4'), ('coeff', '<f4')])"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 7
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "D-fAwQZwckS5",
        "colab_type": "text"
      },
      "source": [
        "We got aspirin back :) let's time it"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "RdWrupurF-w0",
        "colab_type": "code",
        "outputId": "82590e4e-7306-441d-d6c2-af463cbfb797",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 34
        }
      },
      "source": [
        "%%timeit\n",
        "results = cupy_elementwise_search(database, query, 0.7)\n"
      ],
      "execution_count": 8,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "10 loops, best of 3: 32.4 ms per loop\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "7RBUzGivKbI0",
        "colab_type": "text"
      },
      "source": [
        "# Not bad for a brute force approach! But let's also try with the bounds optimisation"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "kl_9szudXYys",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "def get_tanimoto_bounds(qpopcnt, ranges, threshold):\n",
        "    range_to_screen = []\n",
        "    for count, c_range in ranges:\n",
        "        max_sim = min(qpopcnt, count) / max(qpopcnt, count)\n",
        "        if max_sim >= threshold:\n",
        "            range_to_screen.append(c_range)\n",
        "    if range_to_screen:\n",
        "        range_to_screen = (range_to_screen[0][0], \n",
        "                           range_to_screen[len(range_to_screen) - 1][1])\n",
        "    return range_to_screen"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "0ARyUwHQKaUH",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "def cupy_elementwise_search_bounds(db, query, popcnt_ranges, threshold):\n",
        "  \n",
        "    # get the range of molecules to screen\n",
        "    rk = get_tanimoto_bounds(int(query[-1]), popcnt_ranges, threshold)\n",
        "    \n",
        "    # set the threshold variable\n",
        "    threshold = cp.asarray(threshold, dtype=\"f4\")\n",
        "\n",
        "    # get the subset of molecule ids\n",
        "    ids = db[:,0][slice(*rk)]\n",
        "    subset_size = int(rk[1]-rk[0])\n",
        "\n",
        "    # init the results variable\n",
        "    sim = cp.zeros(subset_size, dtype=cp.float32)\n",
        "\n",
        "    # run the search. It will compile the kernel only the first time it runs\n",
        "    taniEW(db[slice(*rk)], query, db.shape[1], threshold, sim, size=subset_size)\n",
        "\n",
        "    # get all non 0 values and ids\n",
        "    mask = sim.nonzero()[0]\n",
        "    np_sim = cp.asnumpy(sim[mask])\n",
        "    np_ids = cp.asnumpy(ids[mask])\n",
        "\n",
        "    # create results numpy array\n",
        "    dtype = np.dtype([(\"mol_id\", \"u4\"), (\"coeff\", \"f4\")])\n",
        "    results = np.empty(len(np_ids), dtype=dtype)\n",
        "    results[\"mol_id\"] = np_ids\n",
        "    results[\"coeff\"] = np_sim\n",
        "    results[::-1].sort(order='coeff')\n",
        "    return results"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "bS5tt0voW-Lt",
        "colab_type": "code",
        "outputId": "e8511868-5502-4b74-f2ca-533631856a08",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 69
        }
      },
      "source": [
        "results = cupy_elementwise_search_bounds(database, query, popcnt_ranges, 0.7)\n",
        "results"
      ],
      "execution_count": 11,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "array([(   1280, 1.        ), (2096455, 0.8888889 ),\n",
              "       ( 271022, 0.85714287), ( 875057, 0.7       )],\n",
              "      dtype=[('mol_id', '<u4'), ('coeff', '<f4')])"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 11
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "8kf4pE27XIZl",
        "colab_type": "code",
        "outputId": "27d7ae57-7088-420c-960f-c7add189fbdf",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 34
        }
      },
      "source": [
        "%%timeit\n",
        "results = cupy_elementwise_search_bounds(database, query, popcnt_ranges, 0.7)\n"
      ],
      "execution_count": 12,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "100 loops, best of 3: 3.47 ms per loop\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "m6dQcxLqYRYC",
        "colab_type": "text"
      },
      "source": [
        "# That was quite good! But... can we speed it up using a [RawKernel](https://docs-cupy.chainer.org/en/stable/tutorial/kernel.html#raw-kernels)?\n",
        "\n",
        "We'll be directly running a bounds optimised version\n",
        "\n",
        "I would recommend to read at least two first chapters of [CUDA C Programming Guide](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html) in order to better understand what's going on here.\n",
        "\n",
        "In this example we are setting a Grid of Thread Blocks with the size of the database. Each Thread Block will have a size of 32 which is the number of uint64 fields storing the 2048 bit fingerprint.\n",
        "\n",
        "Each Thread Block is calculating the popcount of the intersection between each uint64 field of the query and the db molecule present in the block in parallel. When all threads finish, in a syncronized way, thread 0 in each block sums the popcounts and calculates the similarity.\n",
        "\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "9U_I6noWZWYz",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# load database as cupy arrays\n",
        "database = cp.asarray(fps[:,1:-1])\n",
        "ids = cp.asarray(fps[:,0])\n",
        "popcnts = cp.asarray(fps[:,-1])\n",
        "\n",
        "# quering with aspirin (molregno 1280) as usual :P\n",
        "query = fps[(fps[:,0] == query_molregno)]\n",
        "\n",
        "\n",
        "taniRAW = cp.RawKernel(\n",
        "    r\"\"\"\n",
        "    extern \"C\" __global__\n",
        "    void taniRAW(const unsigned long long int* query, \n",
        "                 const unsigned long long int* qcount, \n",
        "                 const unsigned long long int* db,\n",
        "                 const unsigned long long int* popcnts,\n",
        "                 float* threshold,\n",
        "                 float* out) {{\n",
        "\n",
        "        // Shared block array. Only visible for threads in same block\n",
        "        __shared__ int common[{block}];\n",
        "\n",
        "        int tid = blockDim.x * blockIdx.x + threadIdx.x;\n",
        "        common[threadIdx.x] = __popcll(query[threadIdx.x] & db[tid]);\n",
        "\n",
        "        // threads need to wait until all threads finish\n",
        "        __syncthreads();\n",
        "\n",
        "        // thread 0 in each block sums the common bits \n",
        "        // and calcs the final coeff\n",
        "        if(0 == threadIdx.x) \n",
        "        {{\n",
        "            int comm_sum = 0;\n",
        "            for(int i=0; i<{block}; i++)\n",
        "                comm_sum += common[i];\n",
        "\n",
        "            float coeff = 0.0;\n",
        "            coeff = *qcount + popcnts[blockIdx.x] - comm_sum;\n",
        "            if (coeff != 0.0)\n",
        "                coeff = comm_sum / coeff;\n",
        "            out[blockIdx.x] = coeff >= *threshold ? coeff : 0.0;\n",
        "        }}\n",
        "    }}\n",
        "    \"\"\".format(block=database.shape[1]),\n",
        "    name=\"taniRAW\",\n",
        "    options=('-std=c++14',),\n",
        ")"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "fINESKQAZWo8",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "def cupy_sim_search_bounds(db, db_popcnts, db_ids, query, popcnt_ranges, threshold):\n",
        "  \n",
        "    c_query = cp.asarray(query[:,1:-1])\n",
        "    qpopcnt = cp.asarray(query[:,-1])\n",
        "\n",
        "    # get the range of the molecule subset to screen\n",
        "    rk = get_tanimoto_bounds(int(query[:,-1]), popcnt_ranges, threshold)\n",
        "    \n",
        "    threshold = cp.asarray(threshold, dtype=\"f4\")\n",
        "\n",
        "    # get the subset of molecule ids\n",
        "    subset_size = int(rk[1]-rk[0])\n",
        "    ids2 = db_ids[slice(*rk)]\n",
        "\n",
        "    # init results array\n",
        "    sim = cp.zeros(subset_size, dtype=cp.float32)\n",
        "\n",
        "    # run the search, it compiles the kernel only the first time it runs\n",
        "    # grid, block and arguments\n",
        "    taniRAW((subset_size,), \n",
        "            (db.shape[1],), \n",
        "            (c_query, qpopcnt, db[slice(*rk)], db_popcnts[slice(*rk)], threshold, sim))\n",
        "\n",
        "    # get all non 0 values and ids\n",
        "    mask = sim.nonzero()[0]\n",
        "    np_sim = cp.asnumpy(sim[mask])\n",
        "    np_ids = cp.asnumpy(ids2[mask])\n",
        "\n",
        "    # create results numpy array\n",
        "    dtype = np.dtype([(\"mol_id\", \"u4\"), (\"coeff\", \"f4\")])\n",
        "    results = np.empty(len(np_ids), dtype=dtype)\n",
        "    results[\"mol_id\"] = np_ids\n",
        "    results[\"coeff\"] = np_sim\n",
        "    results[::-1].sort(order='coeff')\n",
        "    return results"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Jkx39KZKZW4a",
        "colab_type": "code",
        "outputId": "5b062cd2-60b9-4c02-d9cf-6b6cb193be70",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 69
        }
      },
      "source": [
        "results = cupy_sim_search_bounds(database, popcnts, ids, query, popcnt_ranges, 0.7)\n",
        "results"
      ],
      "execution_count": 15,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "array([(   1280, 1.        ), (2096455, 0.8888889 ),\n",
              "       ( 271022, 0.85714287), ( 875057, 0.7       )],\n",
              "      dtype=[('mol_id', '<u4'), ('coeff', '<f4')])"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 15
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "AParjHT9ZXGe",
        "colab_type": "code",
        "outputId": "4616b822-fb76-4f74-db50-bd0c71fb5b8e",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 34
        }
      },
      "source": [
        "%%timeit\n",
        "results = cupy_sim_search_bounds(database, popcnts, ids, query, popcnt_ranges, 0.7)\n"
      ],
      "execution_count": 16,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "100 loops, best of 3: 2.64 ms per loop\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "jFhK6bMCj56e",
        "colab_type": "text"
      },
      "source": [
        "RawKernel implementation looks like faster! :)"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Fe6G0Bdl-NJT",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        ""
      ],
      "execution_count": 0,
      "outputs": []
    }
  ]
}