{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Predicting Molecular Bond Strengths\n", "\n", "*Created by:* Colin Kälin and Lewis Tunstall, December 2019 \n", "\n", "*Blog post:* https://bit.ly/2Qj82FF\n", "\n", "*Summary:* This notebook provides a walkthrough on creating topological features for machine learning with [giotto-learn](https://github.com/giotto-ai/giotto-learn). \n", "\n", "## Data\n", "\n", "The dataset comes from the [Predicting Molecular Properties](https://www.kaggle.com/c/champs-scalar-coupling/overview) competition on Kaggle. The goal is to predict the coupling strength between atoms in molecules. To download the data files run the following cell:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# uncomment and run to download data\n", "# !wget https://storage.googleapis.com/l2f-open-models/molecule-bond-prediction/data.zip; unzip -o data.zip; rm data.zip" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Related Kaggle kernels\n", "\n", "* Inspiration for the creation of the non-topological features was taken from [here](https://www.kaggle.com/robertburbidge/distance-features). \n", "* For the molecule visualization, [this](https://www.kaggle.com/mykolazotko/3d-visualization-of-molecules-with-plotly) notebook served as a template.\n", "\n", "## Workflow\n", "Here we will briefly give an overview of the pipeline that is used in this notebook:\n", "\n", "1. Choose a point cloud or a graph\n", "2. Apply Vietoris-Rips persistence; once for the point clouds and once for the graphs.\n", "3. Create persistence diagrams and Betti curves from point clouds / graphs.\n", "4. Extract features from the diagrams\n", "5. Train a model and make predictions (in this case with XGBoost)\n", "\n", "## Results\n", "At the end of this walkthrough, you should obtain the results shown in the figure below, which highlight that for this example, the combination of topological and non-topological feature outperform a model based purely on non-topological features.\n", "![title](data/figures/results.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Import libraries" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%load_ext autoreload\n", "%autoreload 2" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# General imports\n", "from itertools import product\n", "import os, random, sys\n", "\n", "sys.path.append(\"src/\")\n", "\n", "# Other\n", "import warnings\n", "\n", "warnings.filterwarnings(\"ignore\")\n", "warnings.simplefilter(action=\"ignore\", category=FutureWarning)\n", "\n", "# Import for data handling\n", "import numpy as np\n", "import pandas as pd\n", "import networkx as nx\n", "import pickle\n", "from data.dataset_utils import get_selected_structures\n", "\n", "# Machine Learning imports\n", "from gtda.homology import VietorisRipsPersistence\n", "from models.model import cv_model\n", "import gtda.diagrams as diag\n", "\n", "# Feature creation\n", "from data.dataset_utils import create_non_TDA_pickle\n", "from features.features import *\n", "\n", "# Plotting imports\n", "import seaborn as sns\n", "import matplotlib.pyplot as plt\n", "from visualization.molecule_plotting import plot_molecule\n", "from visualization.plotting import plot_diagram, plot_betti_curves, plot_molecule_types\n", "import gtda.diagrams as diag\n", "from visualization.plotting_results import *\n", "from plotly.offline import init_notebook_mode, iplot, plot\n", "import plotly.graph_objs as gobj" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Load and explore data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For ths demo we have prepared a dataset that contains the 100 largest molecules from the training set, along with a variety of _non-topological_ features inspired from [this](https://www.kaggle.com/robertburbidge/distance-features) Kaggle kernel." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "with open(\"data/processed/largest_100_molecules.pkl\", \"rb\") as f:\n", " X, y, molecule_selection, structures, molecule_list = pickle.load(f)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The _**features**_ are contained in the `X` dataframe" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "where each row corresponds to an _atom-pair_ and there are 12,609 such pairs in our dataset. The _**target variable**_ (the scalar coupling constant) is a continuous quantity and contained in `y`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "y.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Metadata about the molecules is contained in the `structures` dataframe" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "structures.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "while `molecule_selection` and `molecule_list` contain IDs for each molecule.\n", "\n", "For these 100 molecules we have five different bond types: 1JHC, 2JHC, 3JHC, 2JHH and 3JHH. The plot below shows the different bond strengths related to the different types." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plot_molecule_types(molecule_selection)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Feature creation and visualization\n", "To show how features generated via topological data analysis (TDA) can be used to improve machine learning models, we will combine topological features with conventional ones.\n", "\n", "## Non-topological\n", "For the conventional features we have taken those from [this](https://www.kaggle.com/robertburbidge/distance-features) Kaggle kernel, which include geometric quantities such as the distance of a given atom to the mean $(x,y,z)$ coordinates -- see the kernel for a detailed explanation of what each feature is. In case you want to recreate the data on your own, you can use the function `create_non_TDA_features()` in the `feature.py` script in this repository.\n", "\n", "For the purposes of this notebook we can treat these features as given, since the goal is to focus on the creation of _topological features_.\n", "\n", "## Topological\n", "\n", "In general, there are two different ways to generate topological feature from data: either treat the molecule as a point cloud or a graph.\n", "\n", "### Point cloud\n", "\n", "In order to use TDA we start with a point cloud. In our case this is just the molecule where the $(x,y,z)$ coordinates of the atoms are given with respect to the mean." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "selected_structures = get_selected_structures(molecule_selection); selected_structures.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can take this point cloud to find some interesting structures. In order to do this, we want to create a persistence diagram using the [Vietoris-Rips](https://en.wikipedia.org/wiki/Vietoris%E2%80%93Rips_complex) filtration algorithm." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "persistence_diagrams = []\n", "\n", "for molecule in molecule_selection:\n", " homology_dimensions = [0, 1, 2]\n", " persistence = VietorisRipsPersistence(\n", " metric=\"euclidean\", homology_dimensions=homology_dimensions, n_jobs=1\n", " )\n", "\n", " point_cloud = selected_structures[selected_structures[\"molecule_name\"] == molecule][\n", " [\"x_new\", \"y_new\", \"z_new\"]\n", " ].values\n", " point_cloud = point_cloud.reshape((1, point_cloud.shape[0], point_cloud.shape[1]))\n", " persistence_diagrams.append(persistence.fit_transform(point_cloud))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we are ready to plot the persistence diagram for a given molecule. In order to learn more about how to interpret this diagram, see this [review](https://arxiv.org/pdf/1710.04019.pdf) article.\n", "\n", "You can try out one of the 100 molecules by choosing the `molecule_idx` variable (between 0 and 99):" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "molecule_idx = 72\n", "persistence_fig = plot_diagram(persistence_diagrams[molecule_idx][0])\n", "iplot(persistence_fig)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From these diagrams, a variety of features can be extracted to feed our regression model downstream. Since each persistence diagram is a NumPy array of birth-death-dimension triples, we can code up simple functions to manipulate the data. For instance, we can calculate the _average lifetime_ of points as follows:\n", "\n", "```python\n", "def average_lifetime(X_scaled, homology_dim):\n", " \"\"\"\n", " Parameters\n", " ----------\n", " X_scaled: scaled persistence diagrams, numpy array\n", " homology_dim: dimension of the homology to consider, integer\n", "\n", " Returns\n", " -------\n", " avg_lifetime_list: list of average lifetime for each time window\n", " \"\"\"\n", "\n", " avg_lifetime_list = []\n", "\n", " for i in range(X_scaled.shape[0]):\n", " persistence_table = pd.DataFrame(\n", " X_scaled[i], columns=[\"birth\", \"death\", \"homology\"]\n", " )\n", " persistence_table[\"lifetime\"] = (\n", " persistence_table[\"death\"] - persistence_table[\"birth\"]\n", " )\n", " avg_lifetime_list.append(\n", " persistence_table[persistence_table[\"homology\"] == homology_dim][\n", " \"lifetime\"\n", " ].mean()\n", " )\n", "\n", " return avg_lifetime_list\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Feature-generating functions like these can be found in the `features.py` module of this repo. Let's import the `average_lifetime()` function to see how it works:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from features.features import average_lifetime\n", "\n", "for homology_dim in homology_dimensions:\n", " print(\n", " f\"Homology dimension = {homology_dim} | Average lifetime = {average_lifetime(persistence_diagrams[molecule_idx], homology_dim)}\"\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In other words, for each molecule we can assign 3 new features: \"average lifetime\" for each homology dimension. Another interesting feature is the number of \"relevant\" holes, which returns the number cyan coloured points (H1) that are above some predefined threshold from the birth = death diagonal (the dashed line in the diagram):" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "molecule_idx = 72\n", "molecule_fig = plot_molecule(molecule_selection[molecule_idx], structures)\n", "iplot(molecule_fig)\n", "\n", "# Depending on the threshold 'theta' the function is more or less susceptible to noise\n", "print('Number of relevant holes:', num_relevant_holes(persistence_diagrams[molecule_idx], 1, theta=0.55)[0]) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Similar to the persistence diagram, a plot of the [Betti curves](https://en.wikipedia.org/wiki/Betti_number) can be used to extract features. It focuses on how the number of holes changes with increasing radius $\\varepsilon$ in the filtration:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "molecule_idx = 72\n", "betti_curves = diag.BettiCurve()\n", "betti_curves.fit(persistence_diagrams[molecule_idx])\n", "X_betti_curves = betti_curves.transform(persistence_diagrams[molecule_idx])\n", "betti_fig = plot_betti_curves(X_betti_curves[0])\n", "iplot(betti_fig)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In a similar manner to persistence diagrams, we can also derive features from the Betti curves. One natural feature is the _area under the curve_, which implemented as code might look something like:\n", "\n", "\n", "```python\n", "def area_under_Betti_curve(X_betti_curves, homology_dim):\n", " \"\"\"Compute the area under the Betti curve for a given Betti curve\n", "\n", " Parameters\n", " ----------\n", " X_betti_curves : ndarray, shape (n_samples, n_homology_dimensions, n_values)\n", "\n", " homology_dim : int\n", " Homology dimension to consider, must be contained in the persistence diagram\n", "\n", " Returns\n", " -------\n", " area : list, shape (n_samples)\n", " List of areas under the Betti curve for a given homology dimension.\n", "\n", " \"\"\"\n", " area = []\n", " for n in range(X_betti_curves.shape[0]):\n", " area.append(np.trapz(X_betti_curves[n, homology_dim], dx=1))\n", " return area\n", "```\n", "\n", "To see it in action for a single molecule, we can import it as follows:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from features.features import area_under_Betti_curve" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for homology_dim in homology_dimensions:\n", " print(\n", " f\"Homology dimension = {homology_dim} | Area under Betti curve = {area_under_Betti_curve(X_betti_curves, homology_dim)}\"\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Altogether, we created a wide variety of topological features, including:\n", "\n", " * Number of holes (in dimension 0, 1 and 2): Number of points in the persistence diagram per dimension\n", " * Number of relevant holes (in dimension 0, 1 and 2): Number of points in the persistence diagram that have a lifetime larger than 50% of the maximal lifetime per dimension\n", " * Average lifetime (in dimension 0, 1 and 2): Average lifetime of the points per dimension\n", " * Wasserstein amplitude: one can calculate an amplitude for each persitence diagram. For an explanation see the [docs](https://giotto.ai/theory).\n", " \n", "\n", "We provide the pre-calculated set of features and they can be accessed as follows:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "with open('data/processed/tda_features_cloud.pkl', 'rb') as f:\n", " features_dict_cloud = pickle.load(f)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As the features are molecule-specific, i.e. we have calculated a feature value for each molecule but not for all atom pairs, we have to assign the correct value to the samples in the dataset. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "list(features_dict_cloud.keys())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next we attach these topological features to `X`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "attach_tda_features(X, features_dict_cloud, molecule_list)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Graph" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also take another approach and think about the molecule as a graph instead of as a point cloud. In order to do this, one has to define a distance matrix that indicates the distance between two points in the molecule. Then we can take the same steps as before: calculate the persistence diagram and extract features.\n", "\n", "---\n", "**You should know:** In order for giotto-learn's `VietorisRipsPersistence()` class to know that the matrix we are passing has to be interpreted as a distance matrix and not as a point cloud, it's important to set the variable `metric` to `precomputed`. This can be seen in the `computing_persistence_diagram()` function.\n", "\n", "Calculating the persistence diagrams and the features takes a while. For this reason we load the data from a pickle file. In case you want to recompute the diagrams and the features, set the `recalculate_graph_pd` variable to `True`.\n", "\n", "```python\n", "pers_diag_list_graph = get_graph_persistence_diagrams(molecule_selection, structures, \n", " recalculate_graph_pd=False)\n", "```\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's load the pre-calculated features as a dictionary:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "with open('data/processed/tda_features_graph.pkl', 'rb') as f:\n", " features_dict_graph = pickle.load(f)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And again we add the newly created features to the dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "attach_tda_features(X, features_dict_graph, molecule_list, suffix='_graph')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Model training and evaluation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "At this point we have created some features and would like to test them. We train an XGBRegressor model and use 5-fold cross-validation. Let's start with the non-topological features and later compare the results if we add topological ones.\n", "\n", "The score is calculated as follows (see also [here](https://www.kaggle.com/c/champs-scalar-coupling/overview/evaluation)):\n", "\n", "$$\\text{score} = \\frac{1}{T}\\sum_{t=1}^{T}\\log \\left( \\frac{1}{n_t}\\sum_{i=1}^{n_t}|y_i-\\hat{y}_i| \\right)$$ where: \n", "\n", " * $T$ is the number of coupling types\n", " * $n_t$ is the number of observations of type t\n", " * $y_i$ is the actual coupling value for this sample\n", " * $\\hat{y}_i$ is the predicted coupling value for this sample" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "non_tda_columns = ['atom_index_0', 'atom_index_1', 'type', 'type_0', 'type_1', 'atom_0',\n", " 'x_0', 'y_0', 'z_0', 'atom_1', 'x_1', 'y_1', 'z_1', 'dist', 'dist_x',\n", " 'dist_y', 'dist_z', 'dist_to_type_mean', 'dist_to_type_0_mean',\n", " 'dist_to_type_1_mean', 'molecule_dist_mean_x', 'molecule_dist_std_x',\n", " 'molecule_dist_skew_x', 'molecule_dist_kurt_x', 'molecule_dist_mean_y',\n", " 'molecule_dist_std_y', 'molecule_dist_skew_y', 'molecule_dist_kurt_y',\n", " 'meanx', 'meany', 'meanz', 'dist_0tomean', 'dist_1tomean', 'meanxH',\n", " 'meanyH', 'meanzH', 'dist_0tomeanH', 'dist_1tomeanH', 'meanxC',\n", " 'meanyC', 'meanzC', 'dist_0tomeanC', 'dist_1tomeanC', 'meanxN',\n", " 'meanyN', 'meanzN', 'dist_0tomeanN', 'dist_1tomeanN', 'meanxO',\n", " 'meanyO', 'meanzO', 'dist_0tomeanO', 'dist_1tomeanO', 'meanxF',\n", " 'meanyF', 'meanzF', 'dist_0tomeanF', 'dist_1tomeanF', 'atom_count',\n", " 'atom_0l', 'x_0l', 'y_0l', 'z_0l', 'atom_0r', 'x_0r', 'y_0r', 'z_0r',\n", " 'dist_0l', 'dist_0r', 'atom_1l', 'x_1l', 'y_1l', 'z_1l', 'atom_1r',\n", " 'x_1r', 'y_1r', 'z_1r', 'dist_1l', 'dist_1r']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "results_mean_1, results_details_1 = cv_model(X, y, non_tda_columns)\n", "results_mean_1.append(np.mean(results_mean_1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What changes if we include the TDA features?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "results_mean_2, results_details_2 = cv_model(X, y, X.columns)\n", "results_mean_2.append(np.mean(results_mean_2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's plot the scores and see what improvement we get with TDA. Don't forget that due to the logarithm in the formula above a lower score is acually better (i.e. -0.5 is better than -0.3 for example)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "create_summary_df(np.array([results_mean_1, results_mean_2]).T)\n", "fig = plot_results(create_summary_df(np.array([results_mean_1, results_mean_2]).T))\n", "iplot(fig)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "improvements = (((np.mean((np.array(results_mean_2)[:-1] - np.array(results_mean_1)[:-1])/np.array(results_mean_1)[:-1])*100)), \n", " np.max(((np.array(results_mean_2)[:-1] - np.array(results_mean_1)[:-1])/np.array(results_mean_1)[:-1])*100),\n", " np.min(((np.array(results_mean_2)[:-1] - np.array(results_mean_1)[:-1])/np.array(results_mean_1)[:-1])*100))\n", "\n", "print('Mean improvement:', np.round(improvements[0], 2), '%')\n", "print('Max. improvement:', np.round(improvements[1], 2), '%')\n", "print('Min. improvement:', np.round(improvements[2], 2), '%')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It turns out that for all types of bonds, the topological features helped to improve the score but these improvements were bigger for some types than for others." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 2 }