{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Gravitational wave detection\n",
    "Chrisopher Bresten and Jae-Hun Jung propose to include topological features in a CNN classifier for gravitational waves detection. Adapted from their article, this notebook showcases an application of ideas from the [Topology of time series example](https://giotto-ai.github.io/gtda-docs/latest/notebooks/topology_time_series.html).\n",
    "\n",
    "If you are looking at a static version of this notebook and would like to run its contents, head over to [GitHub](https://github.com/giotto-ai/giotto-tda/blob/master/examples/gravitational_waves_detection.ipynb) and download the source.\n",
    "\n",
    "## Useful references\n",
    "\n",
    "* [Topology of time series](https://giotto-ai.github.io/gtda-docs/latest/notebooks/topology_time_series.html), in which the *Takens embedding* technique used here is explained in detail and illustrated via simple examples.\n",
    "* [Detection of gravitational waves using topological data analysis and convolutional neural network: An improved approach](https://arxiv.org/abs/1910.08245) by Christopher Bresten and Jae-Hun Jung. We thank Christopher Bresten for sharing the code and data used in the article.\n",
    "\n",
    "## See also\n",
    "\n",
    "- [Topology in time series forecasting](https://giotto-ai.github.io/gtda-docs/latest/notebooks/time_series_forecasting.html), in which the Takens embedding technique is used in time series forecasting tasks by using sliding windows.\n",
    "- [Topological feature extraction using VietorisRipsPersistence and PersistenceEntropy](https://giotto-ai.github.io/gtda-docs/latest/notebooks/vietoris_rips_quickstart.html) for a quick introduction to general topological feature extraction in ``giotto-tda``.\n",
    "\n",
    "**License: AGPLv3**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Motivation\n",
    "The videos below show different representations of the gravitational waves that we aim to detect. We will aim to pick out the \"chirp\" signal of two colliding black holes from a very noisy backgound."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from IPython.display import YouTubeVideo\n",
    "\n",
    "YouTubeVideo(\"Y3eR49ogsF0\", width=600, height=400)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "YouTubeVideo(\"QyDcTbR-kEA\", width=600, height=400)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generate the data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the article, the authors create a synthetic training set as follows: \n",
    "\n",
    "* Generate gravitational wave signals that correspond to non-spinning binary black hole mergers\n",
    "* Generate a noisy time series and embed a gravitational wave signal with probability 0.5 at a random time.\n",
    "\n",
    "The result is a set of time series of the form\n",
    "\n",
    "$$ s = g + \\epsilon \\frac{1}{R}\\xi $$\n",
    "\n",
    "where $g$ is a gravitational wave signal from the reference set, $\\xi$ is Gaussian noise, $\\epsilon=10^{-19}$ scales the noise amplitude to the signal, and $R \\in (0.075, 0.65)$ is a parameter that controls the signal-to-noise-ratio (SNR)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Constant signal-to-noise ratio"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As a warmup, let's generate some noisy signals with a constant SNR of 17.98. As shown in Table 1 of the article, this corresponds to an $R$ value of 0.65. By picking the upper end of the interval, we place ourselves in a favorable scenario and, thus, can gain a sense for what the best possible performance is for our time series classifier. We pick a small number of samples to make the computations run fast, but in practice would scale this by 1-2 orders of magnitude as done in the original article."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from data.generate_datasets import make_gravitational_waves\n",
    "from pathlib import Path\n",
    "\n",
    "R = 0.65\n",
    "n_signals = 100\n",
    "DATA = Path(\"./data\")\n",
    "\n",
    "noisy_signals, gw_signals, labels = make_gravitational_waves(\n",
    "    path_to_data=DATA, n_signals=n_signals, r_min=R, r_max=R, n_snr_values=1\n",
    ")\n",
    "\n",
    "print(f\"Number of noisy signals: {len(noisy_signals)}\")\n",
    "print(f\"Number of timesteps per series: {len(noisy_signals[0])}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next let's visualise the two different types of time series that we wish to classify: one that is pure noise vs. one that is composed of noise plus an embedded gravitational wave signal:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from plotly.subplots import make_subplots\n",
    "import plotly.graph_objects as go\n",
    "\n",
    "# get the index corresponding to the first pure noise time series\n",
    "background_idx = np.argmin(labels)\n",
    "# get the index corresponding to the first noise + gravitational wave time series\n",
    "signal_idx = np.argmax(labels)\n",
    "\n",
    "ts_noise = noisy_signals[background_idx]\n",
    "ts_background = noisy_signals[signal_idx]\n",
    "ts_signal = gw_signals[signal_idx]\n",
    "\n",
    "fig = make_subplots(rows=1, cols=2)\n",
    "\n",
    "fig.add_trace(\n",
    "    go.Scatter(x=list(range(len(ts_noise))), y=ts_noise, mode=\"lines\", name=\"noise\"),\n",
    "    row=1,\n",
    "    col=1,\n",
    ")\n",
    "\n",
    "fig.add_trace(\n",
    "    go.Scatter(\n",
    "        x=list(range(len(ts_background))),\n",
    "        y=ts_background,\n",
    "        mode=\"lines\",\n",
    "        name=\"background\",\n",
    "    ),\n",
    "    row=1,\n",
    "    col=2,\n",
    ")\n",
    "\n",
    "fig.add_trace(\n",
    "    go.Scatter(x=list(range(len(ts_signal))), y=ts_signal, mode=\"lines\", name=\"signal\"),\n",
    "    row=1,\n",
    "    col=2,\n",
    ")\n",
    "fig.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We make two observations:\n",
    "1. It is hard to distinguish the signal by eye,\n",
    "2. The signal features some regularity or periodicity.\n",
    "\n",
    "Both observations lead us to examining the _**Takens embedding**_ of the signal $s(t)$, in order to pick up the recurrent structure. Indeed, if $f$ is sampled from a dynamical system with a non-trivial recurrent structure, then, for appropriate parameters, the image by the embedding will have non-trivial topology.\n",
    "\n",
    "More formally,, we extract a sequence of vectors in $\\mathbb{R}^{d}$ of the form\n",
    "\n",
    "$$\n",
    "TD_{d,\\tau} s : \\mathbb{R} \\to \\mathbb{R}^{d}\\,, \\qquad t \\to \\begin{bmatrix}\n",
    "           s(t) \\\\\n",
    "           s(t + \\tau) \\\\\n",
    "           s(t + 2\\tau) \\\\\n",
    "           \\vdots \\\\\n",
    "           s(t + (d-1)\\tau)\n",
    "         \\end{bmatrix},\n",
    "$$\n",
    "where $d$ is the embedding dimension and $\\tau$ is the time delay. The quantity $(d-1)\\tau$ is known as the \"window size\" and the difference between $t_{i+1}$ and $t_i$ is called the stride.\n",
    "\n",
    "Let's examine what the time delay embedding of a pure gravitational wave signal looks like:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from gtda.time_series import SingleTakensEmbedding\n",
    "embedding_dimension = 30\n",
    "embedding_time_delay = 30\n",
    "stride = 5\n",
    "\n",
    "embedder = SingleTakensEmbedding(\n",
    "    parameters_type=\"search\", n_jobs=6, time_delay=embedding_time_delay, dimension=embedding_dimension, stride=stride\n",
    ")\n",
    "\n",
    "y_gw_embedded = embedder.fit_transform(gw_signals[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can use PCA to project our high-dimensional space to 3-dimensions for visualisation:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.decomposition import PCA\n",
    "from gtda.plotting import plot_point_cloud\n",
    "\n",
    "pca = PCA(n_components=3)\n",
    "y_gw_embedded_pca = pca.fit_transform(y_gw_embedded)\n",
    "\n",
    "plot_point_cloud(y_gw_embedded_pca)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From the plot we can see that the decaying periodic signal generated by a black hole merger emerges as a _spiral_ in the time delay embedding space! For contrast, let's compare this to one of the pure noise time series in our sample:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "embedding_dimension = 30\n",
    "embedding_time_delay = 30\n",
    "stride = 5\n",
    "\n",
    "embedder = SingleTakensEmbedding(\n",
    "    parameters_type=\"search\", n_jobs=6, time_delay=embedding_time_delay, dimension=embedding_dimension, stride=stride\n",
    ")\n",
    "\n",
    "y_noise_embedded = embedder.fit_transform(noisy_signals[background_idx])\n",
    "\n",
    "pca = PCA(n_components=3)\n",
    "y_noise_embedded_pca = pca.fit_transform(y_noise_embedded)\n",
    "\n",
    "plot_point_cloud(y_noise_embedded_pca)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Evidently, pure noise resembles a high-dimensional ball in the time delay embedding space. Let's see if we can use persistent homology to tease apart which time series contain a gravitational wave signal versus those that don't. To do so we will adapt the strategy from the original article:\n",
    "\n",
    "1. Generate 200-dimensional time delay embeddings of each time series\n",
    "2. Use PCA to reduce the time delay embeddings to 3-dimensions\n",
    "3. Use the Vietoris-Rips construction to calculate persistence diagrams of $H_0$ and $H_1$ generators\n",
    "4. Extract feature vectors using persistence entropy\n",
    "5. Train a binary classifier on the topological features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Define the topological feature generation pipeline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can do steps 1 and 2 by using the following ``giotto-tda`` tools:\n",
    "\n",
    "- The ``TakensEmbedding`` transformer – instead of ``SingleTakensEmbedding`` – which will transform each time series in ``noisy_signals`` separately and return a collection of point clouds;\n",
    "- ``CollectionTransformer``, which is a convenience \"meta-estimator\" for applying the same PCA to each point cloud resulting from step 1.\n",
    "\n",
    "Using the ``Pipeline`` class from ``giotto-tda``, we can chain all operations up to and including step 4 as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from gtda.diagrams import PersistenceEntropy, Scaler\n",
    "from gtda.homology import VietorisRipsPersistence\n",
    "from gtda.metaestimators import CollectionTransformer\n",
    "from gtda.pipeline import Pipeline\n",
    "from gtda.time_series import TakensEmbedding\n",
    "\n",
    "embedding_dimension = 200\n",
    "embedding_time_delay = 10\n",
    "stride = 10\n",
    "\n",
    "embedder = TakensEmbedding(time_delay=embedding_time_delay,\n",
    "                           dimension=embedding_dimension,\n",
    "                           stride=stride)\n",
    "\n",
    "batch_pca = CollectionTransformer(PCA(n_components=3), n_jobs=-1)\n",
    "\n",
    "persistence = VietorisRipsPersistence(homology_dimensions=[0, 1], n_jobs=-1)\n",
    "\n",
    "scaling = Scaler()\n",
    "\n",
    "entropy = PersistenceEntropy(normalize=True, nan_fill_value=-10)\n",
    "\n",
    "\n",
    "steps = [(\"embedder\", embedder),\n",
    "         (\"pca\", batch_pca),\n",
    "         (\"persistence\", persistence),\n",
    "         (\"scaling\", scaling),\n",
    "         (\"entropy\", entropy)]\n",
    "topological_transfomer = Pipeline(steps)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "features = topological_transfomer.fit_transform(noisy_signals)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Train and evaluate a model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For the final step, let's train a simple classifier on our topological features. As usual we create training and validation sets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "X_train, X_valid, y_train, y_valid = train_test_split(\n",
    "    features, labels, test_size=0.1, random_state=42\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "and then fit and evaluate our model:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import accuracy_score, roc_auc_score\n",
    "\n",
    "\n",
    "def print_scores(fitted_model):\n",
    "    res = {\n",
    "        \"Accuracy on train:\": accuracy_score(fitted_model.predict(X_train), y_train),\n",
    "        \"ROC AUC on train:\": roc_auc_score(\n",
    "            y_train, fitted_model.predict_proba(X_train)[:, 1]\n",
    "        ),\n",
    "        \"Accuracy on valid:\": accuracy_score(fitted_model.predict(X_valid), y_valid),\n",
    "        \"ROC AUC on valid:\": roc_auc_score(\n",
    "            y_valid, fitted_model.predict_proba(X_valid)[:, 1]\n",
    "        ),\n",
    "    }\n",
    "\n",
    "    for k, v in res.items():\n",
    "        print(k, round(v, 3))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LogisticRegression\n",
    "\n",
    "model = LogisticRegression()\n",
    "model.fit(X_train, y_train)\n",
    "print_scores(model)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As a simple baseline, this model is not too bad - it outperforms the deep learning baseline in the article which typically fares little better than random on the raw data. However, the combination of deep learning and persistent homology is where significant performance gains are seen - we leave this as an exercise to the intrepid reader!"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}