{ "cells": [ { "cell_type": "markdown", "source": [ "# Preprocessing time series with aeon\n", "\n", "It is common to need to preprocess time series data before applying machine learning\n", "algorithms. So algorithms can handle these characteristics, or `aeon` transformers can be used to preprocess collections of time\n", "series into standard format. This notebook demonstrates three common use cases\n", "\n", "1. [Rescaling time series](#Rescaling-time-series)\n", "2. [Resizing time series](#Resizing-time-series)\n", "3. [Dealing with missing values](#missing-values)\n" ], "metadata": { "collapsed": false } }, { "cell_type": "markdown", "source": [ "## Rescaling time series\n", "\n", "Different levels of scale and variance can mask discriminative patterns in time\n", "series. This is particularly true for methods that are based on distances. It common\n", "to rescale time series to have zero mean and unit variance. For example, the data in\n", "the `UnitTest` dataset is a subset of the [Chinatown dataset]\n", "(https://timeseriesclassification.com/description.php?Dataset=Chinatown. These are\n", "counts of pedestrians in Chinatown, Melbourne. The time series are of different means" ], "metadata": { "collapsed": false } }, { "cell_type": "code", "source": [ "import numpy as np\n", "\n", "from aeon.datasets import load_unit_test\n", "\n", "X, y = load_unit_test(split=\"Train\")\n", "np.mean(X, axis=-1)[0:5]" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-11-17T13:48:57.655001Z", "start_time": "2024-11-17T13:48:57.631756Z" } }, "outputs": [ { "data": { "text/plain": [ "array([[561.875 ],\n", " [604.95833333],\n", " [629.16666667],\n", " [801.45833333],\n", " [540.75 ]])" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "execution_count": 3 }, { "cell_type": "code", "source": [ "np.std(X, axis=-1)[0:5]" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-11-17T13:48:59.467239Z", "start_time": "2024-11-17T13:48:59.458263Z" } }, "outputs": [ { "data": { "text/plain": [ "array([[428.95224215],\n", " [483.35481095],\n", " [514.90052977],\n", " [629.00847763],\n", " [389.10059218]])" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "execution_count": 4 }, { "cell_type": "markdown", "source": [ "We can rescale the time series in three ways:\n", "1. Normalise: subtract the mean and divide by the standard deviation to make all\n", "series have zero mean and unit variance." ], "metadata": { "collapsed": false } }, { "cell_type": "code", "source": [ "from aeon.transformations.collection import Normalizer\n", "\n", "normalizer = Normalizer()\n", "X2 = normalizer.fit_transform(X)\n", "np.round(np.mean(X2, axis=-1)[0:5], 6)" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-11-17T13:49:01.643630Z", "start_time": "2024-11-17T13:49:01.627083Z" } }, "outputs": [ { "data": { "text/plain": [ "array([[ 0.],\n", " [-0.],\n", " [ 0.],\n", " [-0.],\n", " [-0.]])" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "execution_count": 5 }, { "cell_type": "code", "source": [ "np.round(np.std(X2, axis=-1)[0:5], 6)" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-11-17T13:49:02.670358Z", "start_time": "2024-11-17T13:49:02.648594Z" } }, "outputs": [ { "data": { "text/plain": [ "array([[1.],\n", " [1.],\n", " [1.],\n", " [1.],\n", " [1.]])" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "execution_count": 6 }, { "cell_type": "markdown", "source": [ "2. Re-center: Recentering involves subtracting the mean of each series" ], "metadata": { "collapsed": false } }, { "cell_type": "code", "source": [ "from aeon.transformations.collection import Centerer\n", "\n", "c = Centerer()\n", "X3 = c.fit_transform(X)\n", "np.round(np.mean(X3, axis=-1)[0:5], 6)" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-11-17T13:49:04.345033Z", "start_time": "2024-11-17T13:49:04.332065Z" } }, "outputs": [ { "data": { "text/plain": [ "array([[ 0.],\n", " [-0.],\n", " [ 0.],\n", " [-0.],\n", " [ 0.]])" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "execution_count": 7 }, { "cell_type": "markdown", "source": [ "3. Min-Max: Scale the data to be between 0 and 1" ], "metadata": { "collapsed": false } }, { "cell_type": "code", "source": [ "from aeon.transformations.collection import MinMaxScaler\n", "\n", "minmax = MinMaxScaler()\n", "X4 = minmax.fit_transform(X)\n", "np.round(np.min(X4, axis=-1)[0:5], 6)" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-11-17T13:49:06.135780Z", "start_time": "2024-11-17T13:49:06.116831Z" } }, "outputs": [ { "data": { "text/plain": [ "array([[0.],\n", " [0.],\n", " [0.],\n", " [0.],\n", " [0.]])" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "execution_count": 8 }, { "cell_type": "code", "source": [ "np.round(np.max(X4, axis=-1)[0:5], 6)" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-11-17T13:49:07.094710Z", "start_time": "2024-11-17T13:49:07.072733Z" } }, "outputs": [ { "data": { "text/plain": [ "array([[1.],\n", " [1.],\n", " [1.],\n", " [1.],\n", " [1.]])" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "execution_count": 9 }, { "cell_type": "markdown", "source": [ "There is no best way to do this, although for counts such as this it is more common\n", "to MinMax scale, so that the data still has some interpretation as proportions." ], "metadata": { "collapsed": false } }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Resizing time series\n", "\n", "Suppose we have a collections of time series with different lengths, i.e. different\n", "number of time points. Currently, most of aeon's collection estimators\n", "(classification, clustering or regression) require equal-length time\n", "series. Those that can handle unequal length series are tagged with\n", "\"capability:unequal\"." ] }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2020-12-19T14:31:58.456171Z", "iopub.status.busy": "2020-12-19T14:31:58.455565Z", "iopub.status.idle": "2020-12-19T14:31:59.189497Z", "shell.execute_reply": "2020-12-19T14:31:59.190005Z" }, "pycharm": { "is_executing": true }, "ExecuteTime": { "end_time": "2024-11-17T13:49:10.589292Z", "start_time": "2024-11-17T13:49:10.576923Z" } }, "source": [ "from aeon.classification.convolution_based import RocketClassifier\n", "from aeon.datasets import load_basic_motions, load_japanese_vowels, load_plaid\n", "from aeon.utils.validation import has_missing, is_equal_length, is_univariate" ], "outputs": [], "execution_count": 10 }, { "cell_type": "markdown", "source": [ "If you want to use an estimator that cannot internally handle missing values, one\n", "option is to convert unequal length series into equal length. This can be\n", " done through padding, truncation or resizing through fitting a function and\n", " resampling." ], "metadata": { "collapsed": false } }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Unequal or equal length collections time series\n", "\n", "If a collection contains all equal length series, it will store the data in a 3D\n", "numpy of shape `(n_cases, n_channels, n_timepoints)`. If it is unequal length, it is\n", "stored in a list of 2D numpy arrays:" ] }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2020-12-19T14:31:59.194445Z", "iopub.status.busy": "2020-12-19T14:31:59.193903Z", "iopub.status.idle": "2020-12-19T14:32:01.019896Z", "shell.execute_reply": "2020-12-19T14:32:01.020463Z" }, "pycharm": { "is_executing": true }, "ExecuteTime": { "end_time": "2024-11-17T13:49:17.636972Z", "start_time": "2024-11-17T13:49:17.599882Z" } }, "source": [ "# Equal length multivariate data\n", "bm_X, bm_y = load_basic_motions()\n", "X = bm_X\n", "print(f\"{type(X)}, {X.shape}\")\n", "print(\n", " f\"univariate = {is_univariate(X)}, has missing ={has_missing(X)}, equal \"\n", " f\"length = {is_equal_length(X)}\"\n", ")" ], "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ ", (80, 6, 100)\n", "univariate = False, has missing =False, equal length = True\n" ] } ], "execution_count": 11 }, { "cell_type": "code", "source": [ "# Unequal length univariate data\n", "plaid_X, plaid_y = load_plaid()\n", "X = plaid_X\n", "print(type(plaid_X), \"\\n\", plaid_X[0].shape, \"\\n\", plaid_X[10].shape)\n", "print(\n", " f\"univariate = {is_univariate(X)}, has missing ={has_missing(X)}, equal \"\n", " f\"length = {is_equal_length(X)}\"\n", ")" ], "metadata": { "collapsed": false, "pycharm": { "is_executing": true }, "ExecuteTime": { "end_time": "2024-11-17T13:49:18.797171Z", "start_time": "2024-11-17T13:49:18.626506Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " \n", " (1, 500) \n", " (1, 300)\n", "univariate = True, has missing =False, equal length = False\n" ] } ], "execution_count": 12 }, { "metadata": { "ExecuteTime": { "end_time": "2024-11-17T13:49:19.575389Z", "start_time": "2024-11-17T13:49:19.526275Z" } }, "cell_type": "code", "source": [ "vowels_X, vowels_y = load_japanese_vowels(split=\"train\")\n", "X = vowels_X\n", "print(\n", " f\"univariate = {is_univariate(X)}, has missing ={has_missing(X)}, equal \"\n", " f\"length = {is_equal_length(X)}\"\n", ")" ], "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "univariate = False, has missing =False, equal length = False\n" ] } ], "execution_count": 13 }, { "cell_type": "markdown", "metadata": {}, "source": "\n" }, { "cell_type": "code", "source": [ "series_lengths = [array.shape[1] for array in plaid_X]\n", "\n", "# Find the minimum and maximum of the second dimensions\n", "min_length = min(series_lengths)\n", "max_length = max(series_lengths)\n", "print(\" Min length = \", min_length, \" max length = \", max_length)" ], "metadata": { "collapsed": false, "pycharm": { "is_executing": true }, "ExecuteTime": { "end_time": "2024-11-17T13:49:21.081698Z", "start_time": "2024-11-17T13:49:21.061750Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Min length = 100 max length = 1344\n" ] } ], "execution_count": 14 }, { "metadata": {}, "cell_type": "markdown", "source": [ "There are two basic strategies for unequal length problems\n", "1. Use an estimator that can internally handle missing values\n", "2. Transform the data to be equal length by, for example, truncating or padding series\n", "\n", "Estimators with the tag `\"capability:unequal_length\": True` have the capability to\n", "handle unequal length series. For classification, regression and\n", "clusterign, the\n", "current list is" ] }, { "metadata": { "ExecuteTime": { "end_time": "2024-11-17T13:49:23.280238Z", "start_time": "2024-11-17T13:49:23.143830Z" } }, "cell_type": "code", "source": [ "from aeon.utils.discovery import all_estimators\n", "\n", "all_estimators(\n", " type_filter=[\"classifier\", \"regressor\", \"clusterer\"],\n", " tag_filter={\"capability:unequal_length\": True},\n", ")" ], "outputs": [ { "data": { "text/plain": [ "[('Catch22Classifier',\n", " aeon.classification.feature_based._catch22.Catch22Classifier),\n", " ('Catch22Clusterer', aeon.clustering.feature_based._catch22.Catch22Clusterer),\n", " ('Catch22Regressor', aeon.regression.feature_based._catch22.Catch22Regressor),\n", " ('DummyClassifier', aeon.classification.dummy.DummyClassifier),\n", " ('DummyRegressor', aeon.regression._dummy.DummyRegressor),\n", " ('ElasticEnsemble',\n", " aeon.classification.distance_based._elastic_ensemble.ElasticEnsemble),\n", " ('KNeighborsTimeSeriesClassifier',\n", " aeon.classification.distance_based._time_series_neighbors.KNeighborsTimeSeriesClassifier),\n", " ('KNeighborsTimeSeriesRegressor',\n", " aeon.regression.distance_based._time_series_neighbors.KNeighborsTimeSeriesRegressor),\n", " ('RDSTClassifier', aeon.classification.shapelet_based._rdst.RDSTClassifier),\n", " ('RDSTRegressor', aeon.regression.shapelet_based._rdst.RDSTRegressor)]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "execution_count": 15 }, { "metadata": {}, "cell_type": "markdown", "source": "You can pass these estimators unequal length series and they will work as expected.\n" }, { "metadata": { "ExecuteTime": { "end_time": "2024-11-17T13:49:25.499271Z", "start_time": "2024-11-17T13:49:25.466359Z" } }, "cell_type": "code", "source": [ "from aeon.classification.distance_based import KNeighborsTimeSeriesClassifier\n", "\n", "knn = KNeighborsTimeSeriesClassifier()\n", "model = knn.fit(plaid_X, plaid_y)" ], "outputs": [], "execution_count": 16 }, { "metadata": {}, "cell_type": "markdown", "source": [ "If time series are unequal length, collection estimators will raise an error if they\n", "do not have the capability to handle this characteristic. If you want to use them, \n", "you will need to preprocess the data to be equal length. " ] }, { "metadata": { "ExecuteTime": { "end_time": "2024-11-17T13:49:27.034532Z", "start_time": "2024-11-17T13:49:27.001467Z" } }, "cell_type": "code", "source": [ "rc = RocketClassifier()\n", "try:\n", " rc.fit(plaid_X, plaid_y)\n", "except ValueError as e:\n", " print(f\"ValueError: {e}\")" ], "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ValueError: Data seen by instance of RocketClassifier has unequal length series, but RocketClassifier cannot handle unequal length series. \n" ] } ], "execution_count": 17 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Padding, truncating or resizing.\n", "\n", "We can pad, truncate or resize. By default, pad adds zeros to make all series the\n", "length of the longest, truncate removes all values beyond the length of the shortest\n", "and resize stretches or shrinks the series." ] }, { "cell_type": "code", "metadata": { "execution": { "iopub.execute_input": "2020-12-19T14:32:01.245270Z", "iopub.status.busy": "2020-12-19T14:32:01.244733Z", "iopub.status.idle": "2020-12-19T14:32:02.911970Z", "shell.execute_reply": "2020-12-19T14:32:02.912833Z" }, "pycharm": { "is_executing": true }, "ExecuteTime": { "end_time": "2024-11-17T13:49:29.582165Z", "start_time": "2024-11-17T13:49:29.437476Z" } }, "source": [ "from aeon.transformations.collection import Padder, Resizer, Truncator\n", "\n", "pad = Padder()\n", "truncate = Truncator()\n", "resize = Resizer(length=600)\n", "X2 = pad.fit_transform(plaid_X)\n", "X3 = truncate.fit_transform(plaid_X)\n", "X4 = resize.fit_transform(plaid_X)\n", "print(X2.shape, \"\\n\", X3.shape, \"\\n\", X4.shape)" ], "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(1074, 1, 1344) \n", " (1074, 1, 100) \n", " (1074, 1, 600)\n" ] } ], "execution_count": 18 }, { "metadata": { "ExecuteTime": { "end_time": "2024-11-17T13:49:36.601769Z", "start_time": "2024-11-17T13:49:35.625172Z" } }, "cell_type": "code", "source": [ "import matplotlib.pyplot as plt\n", "\n", "plt.title(\"Before and after padding: PLAID first case (shifted up for unpadded)\")\n", "plt.plot(plaid_X[0][0] + 10)\n", "plt.plot(X2[0][0])" ], "outputs": [ { "data": { "text/plain": [ "[]" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "
" ], "image/png": "" }, "metadata": {}, "output_type": "display_data" } ], "execution_count": 19 }, { "cell_type": "markdown", "source": [ "You can put these transformers in a pipeline to apply to both train/test split\n" ], "metadata": { "collapsed": false } }, { "cell_type": "code", "source": [ "from sklearn.metrics import accuracy_score\n", "\n", "# Unequal length univariate data\n", "from aeon.pipeline import make_pipeline\n", "\n", "train_X, train_y = load_plaid(split=\"Train\")\n", "test_X, test_y = load_plaid(split=\"Test\")\n", "steps = [truncate, rc]\n", "pipe = make_pipeline(steps)\n", "pipe.fit(train_X, train_y)\n", "preds = pipe.predict(test_X)\n", "accuracy_score(train_y, preds)" ], "metadata": { "collapsed": false, "pycharm": { "is_executing": true }, "ExecuteTime": { "end_time": "2024-11-17T13:50:05.966304Z", "start_time": "2024-11-17T13:49:37.145088Z" } }, "outputs": [ { "data": { "text/plain": [ "0.813780260707635" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "execution_count": 20 }, { "metadata": {}, "cell_type": "markdown", "source": [ "## Missing Values\n", "\n", "Missing values are indicated by `NaN` in numpy array. You can test whether any `aeon`\n", " data structure contains missing values using the utility function" ] }, { "metadata": { "ExecuteTime": { "end_time": "2024-11-17T13:50:06.076875Z", "start_time": "2024-11-17T13:50:06.065907Z" } }, "cell_type": "code", "source": [ "X = np.random.random(size=(10, 2, 200))\n", "has_missing(X)" ], "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "execution_count": 21 }, { "metadata": { "ExecuteTime": { "end_time": "2024-11-17T13:50:06.186109Z", "start_time": "2024-11-17T13:50:06.180126Z" } }, "cell_type": "code", "source": [ "X[5][0][55] = np.NAN\n", "has_missing(X)" ], "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "execution_count": 22 }, { "metadata": {}, "cell_type": "markdown", "source": [ "There are a range of strategies for handling missing values. These include:\n", "\n", "1. Use an estimator that internally handles missing values. It is fairly easy for\n", "some algorithms (such as decision trees) to internally deal with missing values,\n", "usually be using it as a distinct series value after discretisation. We do not yet \n", "have many estimators with this capability. Estimators that are able to internally \n", "handle missing values are tagged with `\"capability:missing_values\": True`." ] }, { "metadata": { "ExecuteTime": { "end_time": "2024-11-17T13:50:06.436013Z", "start_time": "2024-11-17T13:50:06.296331Z" } }, "cell_type": "code", "source": [ "from aeon.utils.discovery import all_estimators\n", "\n", "all_estimators(\n", " tag_filter={\"capability:missing_values\": True},\n", ")" ], "outputs": [ { "data": { "text/plain": [ "[('BORF', aeon.transformations.collection.dictionary_based._borf.BORF),\n", " ('CollectionId',\n", " aeon.transformations.collection.compose._identity.CollectionId),\n", " ('DummyClassifier', aeon.classification.dummy.DummyClassifier),\n", " ('DummyRegressor', aeon.regression._dummy.DummyRegressor),\n", " ('RandomSegmenter', aeon.segmentation._random.RandomSegmenter),\n", " ('STRAY', aeon.anomaly_detection._stray.STRAY),\n", " ('SimpleImputer', aeon.transformations.collection._impute.SimpleImputer)]" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "execution_count": 23 }, { "metadata": {}, "cell_type": "markdown", "source": [ "2. Removing series with missing: this is often desirable if the train set size is\n", "large, the number of series with missing is small and the proportion of missing\n", "values for these series is high.\n", "\n", "We do not yet have a transformer for this, but it is easy to implement yourself.\n", "\n", "3. Interpolating missing values from series: estimating the missing values from the \n", "other values in a time series is commonly done. This is\n", " often desirable if the train set size is small and the proportion of missing values\n", " is low. You can do this with the transformer ``SimpleImputer``. This interpolates \n", " each series and each channel independently. So for example a mean interpolation \n", " of series with two channels `[[NaN,1.0,2.0,3.0],[-1.0,-2.0,-3.0,-4.0]]` would be \n", " `[[2.0,1.0,2.0,3.0],[-1.0,-2.0,-3.0,-4.0]]`. " ] }, { "metadata": { "ExecuteTime": { "end_time": "2024-11-17T13:52:23.133162Z", "start_time": "2024-11-17T13:52:23.118202Z" } }, "cell_type": "code", "source": [ "from aeon.transformations.collection import SimpleImputer\n", "\n", "imput = SimpleImputer(strategy=\"mean\")\n", "X2 = imput.fit_transform(X)\n", "has_missing(X2)" ], "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "execution_count": 26 }, { "metadata": { "ExecuteTime": { "end_time": "2024-11-17T13:52:23.825897Z", "start_time": "2024-11-17T13:52:23.811936Z" } }, "cell_type": "code", "source": [ "imp2 = SimpleImputer(strategy=\"median\")\n", "X3 = imp2.fit_transform(X)\n", "has_missing(X3)" ], "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "execution_count": 27 }, { "metadata": { "ExecuteTime": { "end_time": "2024-11-17T13:52:24.602058Z", "start_time": "2024-11-17T13:52:24.582111Z" } }, "cell_type": "code", "source": [ "imp3 = SimpleImputer(strategy=\"constant\", fill_value=0)\n", "X4 = imp3.fit_transform(X)\n", "has_missing(X4)" ], "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "execution_count": 28 }, { "metadata": {}, "cell_type": "code", "outputs": [], "execution_count": null, "source": "" } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.8" } }, "nbformat": 4, "nbformat_minor": 4 }