{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Detecting Stock Market Crashes with Topological Data Analysis\n", "\n", "**_Summary:_** In this notebook we show how topological data analysis (TDA) can be used as a descriptive tool to study stock market crashes based on past price information.\n", "\n", "**_Data:_** We'll be working with the daily time series of the S&P 500 index, which measures the stock performance of the top 500 US companies." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load libraries" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Reload modules before executing user code\n", "%load_ext autoreload\n", "# Reload all modules every time before executing the Python code\n", "%autoreload 2" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Data source\n", "import yfinance as yf\n", "\n", "# Data wrangling\n", "import numpy as np\n", "import pandas as pd\n", "\n", "# Data viz\n", "import seaborn as sns\n", "import matplotlib.pyplot as plt\n", "from plotting import plot_crash_detections, plot_crash_comparisons\n", "from pandas.plotting import register_matplotlib_converters\n", "register_matplotlib_converters()\n", "\n", "%matplotlib inline\n", "%config InlineBackend.figure_format = 'retina'\n", "sns.set(color_codes=True, rc={'figure.figsize':(12, 4)})\n", "sns.set_palette(sns.color_palette('muted'))\n", "\n", "# TDA magic\n", "import gtda.time_series as ts\n", "import gtda.diagrams as diag\n", "import gtda.homology as hl\n", "import gtda.graphs as gr\n", "from gtda.plotting import plot_diagram, plot_point_cloud\n", "from gtda.pipeline import Pipeline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load and explore S&P 500 time series\n", "\n", "Here we use the `yfinance` library to download historical market data from Yahoo! Finance. Let's grab the S&P 500 index and inspect the first / last few rows:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "SP500 = yf.Ticker(\"^GSPC\")\n", "sp500_df = SP500.history(period=\"max\")\n", "sp500_df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sp500_df.tail()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sp500_df.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We are interested in the `Close` price values, so let's take a look at the complete time series:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "price_df = sp500_df['Close']\n", "price_df.plot()\n", "plt.title('Close Price')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As expected, we can see the bursting of the dot-com bubble between 2000-2004, as well as the 2008 financial crisis." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Resampling\n", "Since the markets are closed over weekends, let's resample to evenly-spaced daily values and focus on detecting crashes after 1980:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "start_year = '1980'\n", "# use pad to replace missing values by last non-missing value\n", "price_resampled_df = price_df.resample('24H').pad()[start_year:]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.plot(price_resampled_df)\n", "plt.title('Price')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Time series as point clouds -- Takens' embedding\n", "\n", "Discrete time series – that is, time series indexed by a sequence of times $t_0, t_1, \\ldots$ – are typically visualised as scatter plots in two dimensions. Points in the plot have times $t_i$ as their horizontal co-ordinates, and the corresponding values $y(t_i)$ of the variable of interest, $y$, as their vertical co-ordinates.\n", "\n", "This representation makes the **_local_** behaviour of the time series easy to track by scanning the plot from left to right. But it is often ineffective at conveying important effects which may be occurring over larger time scales.\n", "\n", "One well-known set of techniques for capturing **_periodic_** behaviour comes from Fourier analysis. For instance, the [discrete Fourier transform](https://en.wikipedia.org/wiki/Discrete_Fourier_Transform) of a temporal window over the time series gives information on whether the signal in that window arises as the sum of a few simple periodic signals.\n", "\n", "Here we want to present a different way of encoding a time-evolving process. It is based on the idea that some key properties of the dynamics (including, but more general than, periodicity or quasi-periodicity as seen by Fourier analysis) can be unveiled very effectively in higher dimensions. We will be able to represent a univariate time series (or a single temporal window over that time series) as a **_point cloud_**, i.e. a set of vectors in a Euclidean space of arbitrary dimension.\n", "\n", "The procedure works as follows: we pick two integers $d$ and $\\tau$. For each time $t_i \\in (t_0, t_1, \\ldots )$, we collect the values of the variable $y$ at $d$ distinct times, evenly spaced by $\\tau$ and starting at $t_i$, and present them as a vector with $d$ entries, namely $Y_{t_i} = (y_{t_i}, y_{t_i + \\tau}, \\ldots , y_{t_i + (d-1)\\tau})$. The result is a set of vectors in $d$-dimensional space! $\\tau$ is called the **_time delay_** parameter, and $d$ the **_embedding dimension_**.\n", "\n", "This **_time-delay embedding_** technique is also called **_Takens' embedding_** after Floris Takens, who demonstrated its significance with a celebrated [theorem](https://en.wikipedia.org/wiki/Takens%27s_theorem) in the context of *nonlinear* dynamical systems.\n", "\n", "The **_outer window_** allows us to apply Takens embedding locally on a certain interval rather than over the whole time series. The result of this procedure is therefore a time series of point clouds with possibly interesting topologies." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "embedding_dimension = 3\n", "embedding_time_delay = 2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next we need to initialise the embedder to represent the price time series as a time series of point clouds:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "embedder = ts.SingleTakensEmbedding(\n", " parameters_type=\"fixed\",\n", " dimension=embedding_dimension,\n", " time_delay=embedding_time_delay,\n", " n_jobs=-1,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that our embedder is initialised, it is a simple matter to obtain an array of embeddings, where each element is a 3-dimensional vector:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "price_values = price_resampled_df.values\n", "price_embedded = embedder.fit_transform(price_values)\n", "\n", "embedder_time_delay = embedder.time_delay_\n", "embedder_dimension = embedder.dimension_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next we apply a sliding window to obtain the point clouds per window. We choose the window size such that each interval will span a period of 36 days:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "window_size = 31\n", "window_stride = 4" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sliding_window = ts.SlidingWindow(size=window_size, stride=window_stride)\n", "price_embedded_windows = sliding_window.fit_transform(price_embedded)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have our point clouds, let's visualise one of them:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "window_num = 42\n", "point_cloud = price_embedded_windows[window_num][:, :3]\n", "plot_point_cloud(point_cloud)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## A simple baseline\n", "Here we create a simple baseline that tracks the first derivative of our time series over a sliding window. By using the `SlidingWindow` class from `giotto-learn` we can quickly obtain arrays that are amenable for the `scikit-learn` APIs:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "window_size_price = window_size + embedder_dimension * embedder_time_delay - 2\n", "sliding_window_price = ts.SlidingWindow(size=window_size_price, stride=window_stride)\n", "window_indices = sliding_window_price.slice_windows(price_values)\n", "price_windows = sliding_window_price.fit_transform(price_values)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "abs_derivative_of_means = np.abs(np.mean(np.diff(price_windows, axis=0), axis=1))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# define time index to combine with numpy arrays\n", "indices = [win[1] - 1 for win in window_indices[1:]]\n", "time_index_derivs = price_resampled_df.iloc[indices].index" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "resampled_close_price_derivs = price_resampled_df.loc[time_index_derivs]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(15,5))\n", "plt.plot(time_index_derivs, abs_derivative_of_means, color='#1f77b4')\n", "plt.title('First Derivative of the S&P 500 Index')\n", "plt.savefig('./images/metric_first_derivative.png')\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plot_crash_detections(\n", " start_date=\"1981-01-01\",\n", " end_date=\"2005-01-01\",\n", " threshold=0.3,\n", " distances=abs_derivative_of_means,\n", " time_index_derivs=time_index_derivs,\n", " price_resampled_derivs=resampled_close_price_derivs,\n", " metric_name='First Derivative'\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From the plot we see that the first derivative appears to give an indication of where crashes have happened in the past, albeit with quite some noise around each event. Let's see if we can do better with TDA!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Persistence diagrams\n", "\n", "The first step in a TDA pipeline typically involves calculating persistence diagrams, which encode topological information on the dynamics in the embedding space. The horizontal axis corresponds to the moment in which a homological generator is born, while the vertical axis corresponds to the moments in which an homological generator dies." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# define the number of homology dimensions to track\n", "homology_dimensions = (0, 1)\n", "VR = hl.VietorisRipsPersistence(homology_dimensions=homology_dimensions, n_jobs=-1)\n", "diagrams = VR.fit_transform(price_embedded_windows)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the plot below for a single window, the generators of each homology dimension $H_0$ and $H_1$ are coloured differently:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "VR.plot(diagrams, sample=window_num)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Homological derivatives\n", "\n", "Given a persistence diagram, there are a number of possible features that can derived from it. For our application, we are interested in calculating the distance between diagrams obtained from two successive windows. Although one can calculate this directly from the `PairwiseDistance` class, we show below how one can make use of `giotto-learn`'s API to create a custom transformer:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from joblib import Parallel, delayed, effective_n_jobs\n", "\n", "from sklearn.utils import gen_even_slices\n", "from sklearn.utils.validation import check_is_fitted\n", "\n", "from gtda.diagrams import PairwiseDistance\n", "from gtda.diagrams._metrics import _parallel_pairwise, landscapes, betti_curves\n", "from gtda.diagrams._utils import _subdiagrams\n", "from gtda.utils.validation import check_diagrams\n", "\n", "\n", "class HomologicalDerivative(PairwiseDistance):\n", " def __init__(self, **kw_args):\n", " super().__init__(**kw_args)\n", "\n", " def _derivatives_generic(self, X, s):\n", "\n", " return np.array(\n", " [\n", " _parallel_pairwise(\n", " np.expand_dims(X[t + 1], axis=0),\n", " np.expand_dims(X[t], axis=0),\n", " self.metric,\n", " self.effective_metric_params_,\n", " self.homology_dimensions_,\n", " self.n_jobs,\n", " )[0, 0]\n", " for t in range(s.start, s.stop, 1)\n", " ]\n", " )\n", "\n", " def _derivatives(self, subdiagrams, kind, dim, params):\n", " n_samples, n_points = subdiagrams.shape[:2]\n", " if kind == \"landscape\":\n", " n_layers = min(params[\"n_layers\"], n_points)\n", " features = landscapes(subdiagrams, params[\"samplings\"][dim], n_layers)\n", " elif kind == \"betti\":\n", " features = betti_curves(subdiagrams, params[\"samplings\"][dim])\n", " features = (features[1:] - features[:-1]).reshape(n_samples - 1, -1)\n", " features = np.linalg.norm(features, axis=1, ord=params[\"p\"])\n", "\n", " return (params[\"step_sizes\"][dim] ** (1 / params[\"p\"])) * features\n", "\n", " def fit(self, X, y=None):\n", " super().fit(X, y)\n", "\n", " return self\n", "\n", " def transform(self, X, y=None):\n", " check_is_fitted(self, [\"effective_metric_params_\", \"homology_dimensions_\"])\n", " X = check_diagrams(X)\n", "\n", " if self.metric in [\"landscape\", \"betti\"]:\n", " Xt = Parallel(n_jobs=self.n_jobs)(\n", " delayed(self._derivatives)(\n", " _subdiagrams(X[s.start : s.stop + 1], [dim], remove_dim=True),\n", " self.metric,\n", " dim,\n", " self.effective_metric_params_,\n", " )\n", " for dim in self.homology_dimensions_\n", " for s in gen_even_slices(len(X) - 1, effective_n_jobs(self.n_jobs))\n", " )\n", " Xt = np.concatenate(Xt)\n", " Xt = Xt.reshape(len(self.homology_dimensions_), len(X) - 1).T\n", " else:\n", " Xt = Parallel(n_jobs=self.n_jobs)(\n", " delayed(self._derivatives_generic)(X, s)\n", " for s in gen_even_slices(len(X) - 1, effective_n_jobs(self.n_jobs))\n", " )\n", " Xt = np.concatenate(Xt)\n", "\n", " if self.order is not None:\n", " Xt = np.linalg.norm(Xt, axis=1, ord=self.order)\n", "\n", " return Xt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Landscape distances" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With our transformer defined, let's use it to calculate the successive distance between diagrams using the landscape distance:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "metric_params = {\"p\": 2, \"n_layers\": 10, \"n_bins\": 1000}\n", "\n", "landscape_hom_der = HomologicalDerivative(\n", " metric=\"landscape\", metric_params=metric_params, order=2, n_jobs=-1\n", ")\n", "landscape_succ_dists = landscape_hom_der.fit_transform(diagrams)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(16, 4))\n", "plt.subplot(1, 2, 1)\n", "plt.plot(time_index_derivs, landscape_succ_dists, color=\"#1f77b4\")\n", "plt.title(\"Landscape Distances for S&P 500 Index\")\n", "\n", "plt.subplot(1, 2, 2)\n", "plt.plot(resampled_close_price_derivs, \"#1f77b4\")\n", "plt.title(\"Price\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Distances among diagrams using Betti curves\n", "\n", "In this section we show how to compute distances among persistece diagrams. There are many notions of distances: here we use the $l^p$ *norm of the Betti curves*." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "metric_params = {\"p\": 2, \"n_bins\": 1000}\n", "\n", "\n", "bettiHomDer = HomologicalDerivative(\n", " metric='betti', metric_params=metric_params, order=2, n_jobs=-1\n", ")\n", "betti_succ_dists = bettiHomDer.fit_transform(diagrams)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(16, 4))\n", "plt.subplot(1, 2, 1)\n", "plt.plot(time_index_derivs, betti_succ_dists)\n", "plt.title('Norm of L^p difference')\n", "\n", "plt.subplot(1, 2, 2)\n", "plt.plot(resampled_close_price_derivs)\n", "plt.title('Price')\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(15, 5))\n", "plt.subplot(1, 2, 1)\n", "plt.plot(time_index_derivs, landscape_succ_dists, \"#1f77b4\")\n", "plt.title(\"Landscape Distances of the S&P 500 Index\")\n", "\n", "plt.subplot(1, 2, 2)\n", "plt.plot(time_index_derivs, betti_succ_dists, \"#1f77b4\")\n", "plt.title('Betti Curve Distances of the S&P 500 Index')\n", "plt.savefig('./images/metric_landscape_betti.png')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Topological indicators for crashes\n", "By comparing the different ways to measure change of topological signature, it seems that the landscape approach carries more information and is more robust to noise, let's investigate a bit more what is happening around selected market crashes using the landscape distance.\n", "\n", "Let's investigate the last two major market crashes.\n", "\n", "* **dot-com crash:** from March 11, 2000, to October 9, 2004\n", "\n", "* **Subprime mortgage crisis:** from December 2007 – June 2009\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plot_crash_detections(\n", " start_date=\"1990-01-01\",\n", " end_date=\"2005-01-01\",\n", " threshold=0.3,\n", " distances=landscape_succ_dists,\n", " time_index_derivs=time_index_derivs,\n", " price_resampled_derivs=resampled_close_price_derivs,\n", " metric_name='Landscape Distance'\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From the plot on the right-hand side we see that our simple metric has been able to correctly identify the region in time when the dot-com bubble began to burst! Let's now examine the financial crisis from 2008:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plot_crash_detections(\n", " start_date=\"2005-01-01\",\n", " end_date=\"2012-01-01\",\n", " threshold=0.3,\n", " distances=landscape_succ_dists,\n", " time_index_derivs=time_index_derivs,\n", " price_resampled_derivs=resampled_close_price_derivs,\n", " metric_name='Landscape Distance'\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Again we see that the homological derivative has captured part of the region where the market was crashing. As a final comparison, let us compare the baseline model against our topological features:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plot_crash_comparisons(\n", " start_date=\"1981-01-01\",\n", " end_date=\"2012-01-01\",\n", " threshold=0.3,\n", " distances_1=abs_derivative_of_means,\n", " distances_2=landscape_succ_dists,\n", " time_index_derivs=time_index_derivs,\n", " price_resampled_derivs=resampled_close_price_derivs,\n", ")" ] } ], "metadata": { "kernelspec": { "display_name": "use_c_env_i", "language": "python", "name": "use_c_env_i" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }