{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Shape of Molecules #\n",
    "\n",
    "In this notebook we provide an innovative pipeline that makes it possible to find interesting and meaningful structural features for chiemical compounds by exploiting the package $\\href{https://giotto.ai/}{giotto learn}$. The task of this notebook is to classify chemical compunds as HIV inhibitors or non-inhibitors for the HIV virus. The problem is a benchmark for molecules graph representation as stated in this  $\\href{https://pubs.rsc.org/en/content/articlehtml/2018/sc/c7sc02664a}{paper}$. The novel idea of this notebook is that of exploiting heat diffusion defined over any-order graph cliques (here nodes and edges) to embed the entire graph. In order to defined such diffusion processes we use the definition of higher-order laplcians matrices of clique complexes (special case of simplicial complexes).\n",
    "\n",
    "### Example of heat diffusion over nodes sampled at two different points ###\n",
    "\n",
    "<tr>\n",
    "    <td> <img src=\"images/node_diff_0.png\" align=\"left\" alt=\"Drawing\" style=\"width:50%;\"/> </td>\n",
    "    <td> <img src=\"images/node_diff_20.png\" alt=\"Drawing\" style=\"width:50%;\"/> </td>\n",
    "</tr>\n",
    "\n",
    "### Example of heat diffusion over edges  sampled at two different points ###\n",
    "\n",
    "<tr>\n",
    "    <td> <img src=\"images/edge_diff_0.png\" align=\"left\" alt=\"Drawing\" style=\"width:50%;\"/> </td>\n",
    "    <td> <img src=\"images/edge_diff_20.png\" alt=\"Drawing\" style=\"width:50%;\"/> </td>\n",
    "</tr>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Import statements\n",
    "import warnings; warnings.simplefilter('ignore')\n",
    "import random \n",
    "\n",
    "\n",
    "import numpy as np\n",
    "import networkx as nx\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "from sklearn.cluster import KMeans\n",
    "\n",
    "from giotto.graphs.create_clique_complex import CreateCliqueComplex, CreateBoundaryMatrices, CreateLaplacianMatrices    \n",
    "from giotto.graphs.heat_diffusion import HeatDiffusion\n",
    "from giotto.graphs.graph_entropy import GraphEntropy\n",
    "\n",
    "from molecules import mol_to_nx, compute_node_edge_entropy, bonds_type, graph_to_points, bonds_type_to_edge\n",
    "from plotting import plot_entropies, plot_network_diffusion\n",
    "\n",
    "from rdkit import Chem \n",
    "from rdkit.Chem import Draw\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Import and Convert data to networkx Graph ##"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Import molecules dataset and convert them: $\\textit{smiles}$ --> $\\textit{rdkit.Chem.rdchem.Mol}$ --> $\\textit{networkx.graph}$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Import and convert data\n",
    "df = pd.read_csv('hiv.csv')\n",
    "df['g_mol'] = df['smiles'].apply(lambda x: Chem.MolFromSmiles(x))\n",
    "df.drop(\"smiles\", axis=1, inplace=True)\n",
    "g_mol = [mol_to_nx(df['g_mol'][i]) for i in range(df.shape[0]) if i != 559 and i!= 8097 ]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create Embeddings for all atoms and bonds in the dataset ##\n",
    "\n",
    "Here the embeddings for all atoms and bonds in the dataset are computed. The idea at the core of the procedure is that to embed a specific node (and so atom) $a_i$ , we study the heat diffusion process that has as initial condition a delta function with 1 on node $a_i$  and 0 otherwise. In order to characterize the diffusion process we sample it at different points in time, generating different snapshots. The values of the diffusion process at each snapshot is then treated as a probability distribution over the nodes of the graph (and so atoms of the molecule) and its entropy is computed. At the end, the embedding vector is populated with the entropy values computed over different points in time. In this step it is possible to tune two different hyperparameters: the number of points in time at which the heat diffusion has to be sampled and the last point instant. In this example the different samples are linearly spaced in time but of course it is possible to choose them differently. By exploiting the higher-order laplacians for clique complexes and the giotto-learn package it is possible to apply the same procedure to the edges of the graph and, in general, to higher order cliques. We then compute the same representation for each edge of all molecules, which represent bonds between two atoms. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Hyperparameters\n",
    "taus_n = np.linspace(0,2,20)\n",
    "taus_e = np.linspace(0,2,20)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here the embeddings are computed. It is possible to save time and load them directly from $\\textit{openML}$ in the subsequent cell. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Embedding\n",
    "embeds = [compute_node_edge_entropy(x,i, taus_n, taus_e) for i,x in enumerate(g_mol) ]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create list with node and edge embeddings\n",
    "universal_nodes = list()\n",
    "for i in range(len(embeds)):\n",
    "    universal_nodes.extend(np.split(embeds[i][0][:,:].T, embeds[i][0][0,:].shape[0]))\n",
    "universal_nodes = np.squeeze(np.array(universal_nodes))\n",
    "\n",
    "universal_edges = list()\n",
    "for i in range(len(embeds)):\n",
    "    universal_edges.extend(np.split(embeds[i][1][:,:].T, embeds[i][1][0,:].shape[0]))\n",
    "universal_edges = np.squeeze(np.array(universal_edges))\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once the embeddings have been found, this code can store them in .arff, the format adopted by $\\textit{openML}$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Save the new node embeddings as arff\n",
    "import arff\n",
    "\n",
    "dim_n = 30\n",
    "attributes_n = [('n{}'.format(e), 'REAL') for e in range(dim_n)]\n",
    "    \n",
    "node_dict = {\n",
    "    'relation': 'Node_embedding_hiv',\n",
    "    'description': 'This dataset contains the embedding for all nodes of moelcules in HIV inhibitors dataset',\n",
    "    'attributes': attributes_n,\n",
    "    'data': universal_nodes.tolist()\n",
    "}\n",
    "\n",
    "_= arff.dump(node_dict, open(\"node_embedding_hiv.arff\", \"w+\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Save the new edge embeddings as arff\n",
    "import arff\n",
    "\n",
    "dim_e = 30\n",
    "attributes_e = [('e{}'.format(e), 'REAL') for e in range(dim_e)]\n",
    "    \n",
    "edge_dict = {\n",
    "    'relation': 'Edge_embedding_hiv',\n",
    "    'description': 'This dataset contains the embedding for all edges of moelcules in HIV inhibitors dataset',\n",
    "    'attributes': attributes_e,\n",
    "    'data': universal_edges.tolist()\n",
    "}\n",
    "\n",
    "_ = arff.dump(edge_dict, open(\"edge_embedding_hiv.arff\", \"w+\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load Pre-Generated Embeddings ##\n",
    "\n",
    "Here the previously found embeddings can be loaded from the $\\textit{OpenML}$ site."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import arff\n",
    "import openml"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load nodes (atoms) embedding\n",
    "data_n = openml.datasets.get_dataset(42218)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load edges (bonds) embedding\n",
    "data_e = openml.datasets.get_dataset(42220)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "universal_nodes = np.array(data_n.get_data()[0])\n",
    "universal_edges = np.array(data_e.get_data()[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Check shape atoms dataset: {}\".format(universal_nodes.shape))\n",
    "print(\"Check shape edges dataset: {}\".format(universal_edges.shape))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Plot Entropies Example ##"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see now one example of heat diffuion's entropy profiles on nodes: first of all several entropy profiles are plotted for different nodes (0-cliques). We then plot the molecule graph in which each node has its label attached. It is possible to observe the local graphical structure and the related diffusion etropy profile. We can see how the heat diffusion entropy depends on the position and so on the role one node has within the network. For example, node 4 is a hub for the molecule and it can easily diffuse heat immediately. This effect is captured by the fact that entropy blows up in the first time samples. On the other hand, node 0 is almost isolated and again the slow spreading effect is captured by the almost flat entropy curve. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Double check that can be useful later to embed entire molecules."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mol_to_atom = graph_to_points(g_mol, 0)\n",
    "mol_to_bonds = graph_to_points(g_mol, 1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Node Analysis\n",
    "plt.figure(figsize=(20,6))\n",
    "\n",
    "plt.subplot(1, 2, 1)\n",
    "mol_id = 0\n",
    "node_ids = [0, 4, 7, 15]\n",
    "mol = g_mol[mol_id]\n",
    "\n",
    "entropies = [ universal_nodes[x] for x in mol_to_atom[mol_id] ]\n",
    "plot_entropies( entropies, node_ids)\n",
    "\n",
    "# Node Plotting\n",
    "plt.subplot(1, 2, 2)\n",
    "plot_network_diffusion(mol, pos=nx.spring_layout(mol, iterations=1000), node_labels=True)\n",
    "_ = plt.title('Molecule as networkx object')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Adding Bonds information #\n",
    "\n",
    "We now add one piece of chemical information to the problem. The first two methods attach to each molecule graph the information about the type of chemical bonds. In particular, each edge is equipped with a one-hot-encoded vector representation with the number 1 in the position of the corresponding bond type: \n",
    "\n",
    "$0 --> single$\n",
    "\n",
    "$1 --> double$\n",
    "\n",
    "$2 --> triple$\n",
    "\n",
    "$3 --> aromatic$\n",
    "\n",
    "For nodes the vector is obtained by summing up all the one-hot-encoded vectors representing the incidence edges."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bonds_type(g_mol)\n",
    "bonds_type_to_edge(g_mol)\n",
    "\n",
    "#Create a list with all nodes \n",
    "freq_type_bonds = list()\n",
    "for g in g_mol:\n",
    "    freq_type_bonds.extend(list(nx.get_node_attributes(g, 'bonds_one_hot').values()))\n",
    "\n",
    "# Create a list with all edges\n",
    "freq_type_bonds_edge = list()\n",
    "for g in g_mol:\n",
    "    freq_type_bonds_edge.extend(list(nx.get_edge_attributes(g, 'bonds_one_hot').values()))\n",
    "\n",
    "# Check how many atoms and bonds \n",
    "freq_type_bonds = np.array(freq_type_bonds)\n",
    "freq_type_bonds_edge = np.array(freq_type_bonds_edge)\n",
    "print(\"Total number of atoms in the dataset: {}\".format(freq_type_bonds.shape[0]))\n",
    "print(\"Total number of bonds in the dataset: {}\".format(freq_type_bonds_edge.shape[0]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Universal Node + Bond Types Embeddding #"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We split the molecule embedding list and create one big list with one extended entry per node contaning both the topological features taken from the entropy embedding and the chemical features related to the bond types."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "uni_frq_nodes = [np.hstack([universal_nodes[x,:], freq_type_bonds[x,:]]) for x in range(universal_nodes.shape[0])]\n",
    "uni_frq_nodes = np.array(uni_frq_nodes)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now cluster all molecules into $\\textit{n_clusters}$ different classes which is another hyperparamter of the pipeline. Moreover, centroids for all classes are stored. They'll be used later to generate the final embedding."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Kmeans clustering\n",
    "n_clusters=10\n",
    "kmeans_n = KMeans(n_clusters)\n",
    "universal_class_nodes = kmeans_n.fit_transform(uni_frq_nodes)\n",
    " \n",
    "centroids_n = kmeans_n.cluster_centers_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now popoulate, for each atom, a vector which contains the probability for that atom of belonging to the different $\\textit{n_clusters}$ classes. Once this is obtained, we generate the embedding for a molecule by taking the element-wise sum of the embeddings coming from all its atoms."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Soft Encoded\n",
    "soft_encoded_node = [[ np.exp( - (np.linalg.norm(uni_frq_nodes[x]- centroids_n[c], 2) ** 2) / 2) for c in range(n_clusters)] for x in range(uni_frq_nodes.shape[0])]\n",
    "soft_encoded_node = np.array(soft_encoded_node)\n",
    "\n",
    "# Create node data for each graph\n",
    "x_data_node = [ np.sum([soft_encoded_node[n] for n in mol_to_atom[i]], axis=0) for i in range(len(g_mol))]\n",
    "x_data_node = np.array(x_data_node)\n",
    "print(\"Check shape of x_data_node: {}\".format(x_data_node.shape))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Universal Edge + Bond Types Embeddding  #"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Exatcly the same procedure is applied to the edges of the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "uni_frq_edges = [np.hstack([universal_edges[x,:], freq_type_bonds_edge[x,:]]) for x in range(universal_edges.shape[0])]\n",
    "uni_frq_edges = np.array(uni_frq_edges)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Kmeans clustering\n",
    "e_clusters = 10\n",
    "kmeans_e = KMeans(e_clusters)\n",
    "universal_class_edge = kmeans_e.fit_transform(uni_frq_edges)\n",
    "\n",
    "centroids_e = kmeans_e.cluster_centers_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Soft Encoded\n",
    "soft_encoded_edge = [[ np.exp( - (np.linalg.norm(uni_frq_edges[x]- centroids_e[c], 2) ** 2) / 2) for c in range(e_clusters)] for x in range(universal_edges.shape[0])]\n",
    "soft_encoded_edge = np.array(soft_encoded_edge)\n",
    "\n",
    "# Create edge data for each graph\n",
    "x_data_edge = [ np.sum([soft_encoded_edge[n] for n in mol_to_bonds[i]], axis=0)  for i in range(len(g_mol))]\n",
    "x_data_edge = np.array(x_data_edge)\n",
    "print(\"Check shape of x_data_edge: {}\".format(x_data_edge.shape))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Classification Model and Evaluation#"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This part of the notebook contains the classificator model and training."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Prepare x_data\n",
    "x_data = np.hstack([x_data_node, x_data_edge])\n",
    "x_data -= np.mean(x_data, axis=0)\n",
    "x_data /= (np.max(x_data, axis=0) - np.min(x_data, axis=0))\n",
    "\n",
    "print(\"Check shape of x_data: {}\".format(x_data.shape))\n",
    "\n",
    "#Prepare y_data\n",
    "y_data = [df['HIV_active'][i] for i in range(df.shape[0]) if i != 8079 and i != 559]\n",
    "y_data = np.array(y_data)\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import random\n",
    "\n",
    "f = np.arange(41911)\n",
    "# Change random seed for different experiments\n",
    "random.Random(10).shuffle(f)\n",
    "train = f[:36000]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We split data into training and test sets. The model has been previously chosen by adopting the usual cross-validation procedure. From this point on, it is possible to play with different models and spot differences."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.random.shuffle(train)\n",
    "\n",
    "i_train = train[:36000]\n",
    "i_test = f[36000:]\n",
    "\n",
    "x_train = x_data[i_train, :]\n",
    "y_train = y_data[i_train]\n",
    "\n",
    "x_test = np.array([np.array(x_data[i,:]) for i in f[36000:]])\n",
    "y_test = np.array([y_data[i] for i in f[36000:]])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from keras.models import Sequential\n",
    "from keras.layers import Dense, Dropout, BatchNormalization\n",
    "from keras import optimizers\n",
    "\n",
    "from sklearn.metrics import roc_auc_score"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# define the keras model\n",
    "model = Sequential()\n",
    "\n",
    "model.add(Dense(32, activation='relu'))\n",
    "model.add(Dropout(rate=0.4))\n",
    "model.add(BatchNormalization())\n",
    "\n",
    "model.add(Dense(64, activation='relu'))\n",
    "model.add(Dropout(rate=0.4))\n",
    "model.add(BatchNormalization())\n",
    "\n",
    "model.add(Dense(1, activation='sigmoid'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "adam = optimizers.Adam(lr=0.001)\n",
    "model.compile(loss='binary_crossentropy', optimizer=adam, metrics=['accuracy'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model.fit(x_train, y_train, epochs=100, batch_size=128)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We validate here the classification model by adopting the $\\href{https://it.wikipedia.org/wiki/Receiver_operating_characteristic}{AUC-ROC}$ score as quality metric. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# evaluate the keras model\n",
    "pe, accuracy = model.evaluate(x_test, y_test)\n",
    "print('Accuracy: %.2f' % (accuracy*100))\n",
    "\n",
    "p_train = model.predict(x_train)\n",
    "p_test = model.predict(x_test)\n",
    "\n",
    "print(\" Train AUC-ROC : {}\".format(roc_auc_score(y_train, p_train)))\n",
    "\n",
    "print(\" Test AUC-ROC : {}\".format(roc_auc_score(y_test, p_test)))\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python (py35)",
   "language": "python",
   "name": "py35"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}