{ "cells": [ { "cell_type": "markdown", "source": [ "# Using aeon distances with scikit learn\n", "\n", "Scikit learn has a range of algorithms based on distances, including classifiers,\n", "regressors and clusterers. These can generally all be used with aeon distances\n", "in two ways.\n", "\n", "1. Pass the distance function as a callable to the `metric` or `kernel` parameter\n", "in the constructor or\n", "2. Set the `metric` or `kernel` to precomputed in the constructor and pass a\n", "pairwise distance matrix to `fit` and `predict`." ], "metadata": { "collapsed": false } }, { "cell_type": "markdown", "source": [ "## K-Nearest Neighbour Classification and Regression in sklearn.neighbors" ], "metadata": { "collapsed": false } }, { "cell_type": "code", "execution_count": 1, "outputs": [], "source": [ "from sklearn.neighbors import (\n", " KNeighborsClassifier,\n", " KNeighborsRegressor,\n", " KNeighborsTransformer,\n", " NearestCentroid,\n", ")" ], "metadata": { "collapsed": false } }, { "cell_type": "markdown", "source": [ "If we have a univariate problem stored as a 2D numpy shape\n", "`(n_cases,n_timepoints)`, we can apply these estimators directly,\n", "but it is treating the data as vector based rather than time series.\n", "\n", "If we try and use with an aeon style 3D numpy\n", "`(n_cases, n_channels, n_timepoints)`, they will crash. See the\n", "[data_formats](../datasets/data_structures.ipynb) for details on data storage." ], "metadata": { "collapsed": false } }, { "cell_type": "code", "execution_count": 2, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "KNN with manhatttan distance on 2D time series data first five ['1' '2' '2' '1' '1']\n", "Shape of train = (50, 1, 150) sklearn will crash\n", "raises an ValueError: Found array with dim 3.KNeighborsClassifier expected <= 2.\n" ] } ], "source": [ "from aeon.datasets import load_gunpoint\n", "\n", "trainx1, trainy1 = load_gunpoint(split=\"TRAIN\", return_type=\"numpy2D\")\n", "testx1, testy1 = load_gunpoint(split=\"TEST\", return_type=\"numpy2D\")\n", "# Use directly on TSC data with standard scikit distances\n", "knn = KNeighborsClassifier(metric=\"manhattan\")\n", "knn.fit(trainx1, trainy1)\n", "print(\n", " \"KNN with manhatttan distance on 2D time series data first five \",\n", " knn.predict(testx1)[:5],\n", ")\n", "trainx2, trainy2 = load_gunpoint(split=\"TRAIN\")\n", "testx2, testy2 = load_gunpoint(split=\"TEST\")\n", "print(\"Shape of train = \", trainx2.shape, \"sklearn will crash\")\n", "try:\n", " knn.fit(trainx2, trainy2)\n", "except ValueError:\n", " print(\n", " \"raises an ValueError: Found array with dim 3.\"\n", " \"KNeighborsClassifier expected <= 2.\"\n", " )" ], "metadata": { "collapsed": false } }, { "cell_type": "markdown", "source": [ "We can use `KNeighborsClassifier` with a callable distance function with an aeon\n", "distance function, but the input must still be 2D numpy. We can also use them as\n", "callables in other `sklearn.neighbors` estimators" ], "metadata": { "collapsed": false } }, { "cell_type": "code", "execution_count": 3, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "KNN with DTW on 2D time series data first five predictions= ['1' '2' '2' '1' '2']\n", "raises an ValueError: Found array with dim 3. KNeighborsClassifier expected <= 2.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "C:\\Code\\aeon\\venv\\lib\\site-packages\\sklearn\\neighbors\\_nearest_centroid.py:179: UserWarning: Averaging for metrics other than euclidean and manhattan not supported. The average is set to be the mean.\n", " warnings.warn(\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "nc with ERP on 2D time series data first five predictions= ['1' '1' '2' '1' '1']\n", "nc with ERP on 2D time series data transform shape = (150, 50)\n" ] } ], "source": [ "from aeon.distances import (\n", " dtw_distance,\n", " edr_distance,\n", " erp_distance,\n", " msm_distance,\n", " twe_distance,\n", ")\n", "\n", "# Use directly on TSC data with aeon distance function\n", "knn = KNeighborsClassifier(metric=dtw_distance)\n", "knn.fit(trainx1, trainy1)\n", "print(\n", " \"KNN with DTW on 2D time series data first five predictions= \",\n", " knn.predict(testx1)[:5],\n", ")\n", "try:\n", " knn.fit(trainx2, trainy2)\n", "except ValueError:\n", " print(\n", " \"raises an ValueError: Found array with dim 3. \"\n", " \"KNeighborsClassifier expected <= 2.\"\n", " )\n", "nc = NearestCentroid(metric=erp_distance)\n", "nc.fit(trainx1, trainy1)\n", "print(\n", " \"nc with ERP on 2D time series data first five predictions= \",\n", " nc.predict(testx1)[:5],\n", ")\n", "kt = KNeighborsTransformer(metric=edr_distance)\n", "kt.fit(trainx1, trainy1)\n", "print(\n", " \"nc with ERP on 2D time series data transform shape = \", kt.transform(testx1).shape\n", ")" ], "metadata": { "collapsed": false } }, { "cell_type": "markdown", "source": [ "Also note that the callable will not work with some KNN `algorithm` options such as\n", "`kd_tree`, which raises an error `kd_tree does not support callable metric`. Because of\n", "these problems, we have implemented a KNN classifier and regressor to use with our\n", "distance functions." ], "metadata": { "collapsed": false } }, { "cell_type": "code", "execution_count": 4, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "KNN classification with MSM 3D time series data first five predictions= ['1' '2' '2' '1' '1']\n", "aeon KNN regression with TWE first five predictions= [0.02558824 0.00877193 0.01960784 0.03533314 0.00805611]\n", "sklearn KNN regression with TWE first five predictions= [0.02558824 0.00877193 0.01960784 0.03533314 0.00805611]\n" ] } ], "source": [ "from aeon.classification.distance_based import KNeighborsTimeSeriesClassifier\n", "from aeon.datasets import load_covid_3month # Regression problem\n", "from aeon.regression.distance_based import KNeighborsTimeSeriesRegressor\n", "\n", "knn1 = KNeighborsTimeSeriesClassifier(distance=\"msm\")\n", "knn1.fit(trainx1, trainy1)\n", "print(\n", " \"KNN classification with MSM 3D time series data first five predictions=\",\n", " knn1.predict(testx1)[:5],\n", ")\n", "trainx3, trainy3 = load_covid_3month(split=\"train\")\n", "testx3, testy3 = load_covid_3month(split=\"test\")\n", "knn2 = KNeighborsTimeSeriesRegressor(distance=\"twe\", n_neighbors=1)\n", "knn2.fit(trainx3, trainy3)\n", "print(\n", " \"aeon KNN regression with TWE first five predictions=\",\n", " knn2.predict(testx3)[:5],\n", ")\n", "knr = KNeighborsRegressor(metric=twe_distance, n_neighbors=1)\n", "trainx4 = trainx3.squeeze()\n", "testx4 = testx3.squeeze()\n", "knr.fit(trainx4, trainy3)\n", "print(\n", " \"sklearn KNN regression with TWE first five predictions=\",\n", " knr.predict(testx4)[:5],\n", ")" ], "metadata": { "collapsed": false } }, { "cell_type": "markdown", "source": [ "Another alternative is to pass the distance measures as precomputed.\n", "Note that this requires the calculation of the full distance matricies,\n", "and still cannot be used with `algorithm` options." ], "metadata": { "collapsed": false } }, { "cell_type": "code", "execution_count": 5, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "KNN precomputed= ['1' '2' '2' '1' '1']\n" ] } ], "source": [ "from aeon.distances import euclidean_pairwise_distance\n", "\n", "train_dists = euclidean_pairwise_distance(trainx2)\n", "test_dists = euclidean_pairwise_distance(testx2, trainx2)\n", "knn = KNeighborsClassifier(metric=\"precomputed\")\n", "knn.fit(train_dists, trainy2)\n", "print(\"KNN precomputed=\", knn.predict(test_dists)[:5])" ], "metadata": { "collapsed": false } }, { "cell_type": "markdown", "source": [ "## Support Vector Machine Classification and Regression in sklearn.svm" ], "metadata": { "collapsed": false } }, { "cell_type": "code", "execution_count": 6, "outputs": [], "source": [ "from sklearn.svm import SVC, SVR, NuSVC, NuSVR" ], "metadata": { "collapsed": false } }, { "cell_type": "markdown", "source": [ "The SVM estimators in scikit can be used with pairwise distance matrices.\n", "Please note that not all elastic distance functions are kernels, and it\n", "is desirable that they are for SVM. DTW is not a metric, but MSM and TWE\n", "are." ], "metadata": { "collapsed": false } }, { "cell_type": "code", "execution_count": 7, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "SVC with TWE first five predictions= ['1' '2' '1' '2' '2']\n", "NuSVC with MSM first five predictions= ['1' '2' '2' '1' '2']\n", "SVR with DTW first five predictions= [0.08823529 0.08823529 0.08823529 0.08823529 0.08823529]\n" ] } ], "source": [ "from aeon.distances import (\n", " dtw_pairwise_distance,\n", " msm_pairwise_distance,\n", " twe_distance,\n", " twe_pairwise_distance,\n", ")\n", "\n", "svc = SVC(kernel=\"precomputed\")\n", "svr = SVR(kernel=\"precomputed\")\n", "nsvc = NuSVC(kernel=\"precomputed\")\n", "nsvr = NuSVR(kernel=twe_distance)\n", "train_m1 = twe_pairwise_distance(trainx1)\n", "test_m1 = twe_pairwise_distance(testx1, trainx1)\n", "svc.fit(train_m1, trainy1)\n", "print(\"SVC with TWE first five predictions= \", svc.predict(test_m1)[:5])\n", "train_m2 = msm_pairwise_distance(trainx2)\n", "test_m2 = msm_pairwise_distance(testx2, trainx2)\n", "nsvc.fit(train_m2, trainy2)\n", "print(\"NuSVC with MSM first five predictions= \", nsvc.predict(test_m2)[:5])\n", "train_m3 = dtw_pairwise_distance(trainx3)\n", "test_m3 = dtw_pairwise_distance(testx3, trainx3)\n", "svr.fit(train_m3, trainy3)\n", "print(\"SVR with DTW first five predictions= \", svr.predict(test_m3)[:5])" ], "metadata": { "collapsed": false } }, { "cell_type": "markdown", "source": [ "# Clustering with sklearn.cluster" ], "metadata": { "collapsed": false } }, { "cell_type": "code", "execution_count": 8, "outputs": [], "source": [ "from sklearn.cluster import DBSCAN" ], "metadata": { "collapsed": false } }, { "cell_type": "markdown", "source": [ "Some sklearn clustering algorithms accept callable distances or precomputed distance\n", "matrices, and these can be used with aeon distance functions.\n", "\n", "Note that DBSCAN can only make predictions on the train data, so it has no predict\n", "function.\n" ], "metadata": { "collapsed": false } }, { "cell_type": "code", "execution_count": 13, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[-1 0 0 0 0]\n", "[-1 0 0 0 0]\n", "[-1 0 0 0 0]\n" ] } ], "source": [ "db1 = DBSCAN(eps=2.5)\n", "preds1 = db1.fit_predict(trainx1)\n", "print(preds1[:5])\n", "db2 = DBSCAN(metric=msm_distance, eps=2.5)\n", "db3 = DBSCAN(metric=\"precomputed\", eps=2.5)\n", "preds2 = db2.fit_predict(trainx1)\n", "print(preds1[:5])\n", "preds3 = db3.fit_predict(train_m2)\n", "print(preds1[:5])" ], "metadata": { "collapsed": false } }, { "cell_type": "markdown", "source": [ "You can use pairwise distance functions within the scikit learn `FunctionTransformer`\n", " wrapper" ], "metadata": { "collapsed": false } }, { "cell_type": "code", "execution_count": 11, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Shape = (5, 5)\n", "These should be the same 10.479268600222673 and 10.479268600222673\n" ] } ], "source": [ "from sklearn.preprocessing import FunctionTransformer\n", "\n", "from aeon.datasets import make_example_3d_numpy\n", "from aeon.distances import msm_distance, msm_pairwise_distance\n", "\n", "X = make_example_3d_numpy(n_cases=5, n_channels=1, n_timepoints=10)\n", "ft = FunctionTransformer(msm_pairwise_distance)\n", "X2 = ft.transform(X)\n", "print(\" Shape = \", X2.shape)\n", "d = msm_distance(X[0], X[1])\n", "print(f\"These should be the same {d} and {X2[0][1]}\")" ], "metadata": { "collapsed": false } }, { "cell_type": "markdown", "source": [ "This makes it easy to use distances as features in an a scikit-learn pipeline.\n", "Whether it is a good idea to do this is a separate question!\n" ], "metadata": { "collapsed": false } }, { "cell_type": "code", "execution_count": 12, "outputs": [ { "data": { "text/plain": "array([1, 1, 1, 1, 1])" }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.pipeline import Pipeline\n", "\n", "X, y = make_example_3d_numpy(n_cases=5, n_channels=1, n_timepoints=10, return_y=True)\n", "X2 = make_example_3d_numpy(n_cases=5, n_channels=1, n_timepoints=10)\n", "\n", "pipe = Pipeline(steps=[(\"FunctionTransformer\", ft), (\"SVM\", SVC())])\n", "pipe.fit(X, y)\n", "pipe.predict(X2)" ], "metadata": { "collapsed": false } }, { "cell_type": "code", "execution_count": null, "outputs": [], "source": [], "metadata": { "collapsed": false } } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.6" } }, "nbformat": 4, "nbformat_minor": 0 }