{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Approximate nearest neighbors in TSNE\n\nThis example presents how to chain KNeighborsTransformer and TSNE in a pipeline.\nIt also shows how to wrap the packages `nmslib` and `pynndescent` to replace\nKNeighborsTransformer and perform approximate nearest neighbors. These packages\ncan be installed with `pip install nmslib pynndescent`.\n\nNote: In KNeighborsTransformer we use the definition which includes each\ntraining point as its own neighbor in the count of `n_neighbors`, and for\ncompatibility reasons, one extra neighbor is computed when `mode == 'distance'`.\nPlease note that we do the same in the proposed `nmslib` wrapper.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Authors: The scikit-learn developers\n# SPDX-License-Identifier: BSD-3-Clause" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First we try to import the packages and warn the user in case they are\nmissing.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import sys\n\ntry:\n import nmslib\nexcept ImportError:\n print(\"The package 'nmslib' is required to run this example.\")\n sys.exit()\n\ntry:\n from pynndescent import PyNNDescentTransformer\nexcept ImportError:\n print(\"The package 'pynndescent' is required to run this example.\")\n sys.exit()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We define a wrapper class for implementing the scikit-learn API to the\n`nmslib`, as well as a loading function.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import joblib\nimport numpy as np\nfrom scipy.sparse import csr_matrix\n\nfrom sklearn.base import BaseEstimator, TransformerMixin\nfrom sklearn.datasets import fetch_openml\nfrom sklearn.utils import shuffle\n\n\nclass NMSlibTransformer(TransformerMixin, BaseEstimator):\n \"\"\"Wrapper for using nmslib as sklearn's KNeighborsTransformer\"\"\"\n\n def __init__(self, n_neighbors=5, metric=\"euclidean\", method=\"sw-graph\", n_jobs=-1):\n self.n_neighbors = n_neighbors\n self.method = method\n self.metric = metric\n self.n_jobs = n_jobs\n\n def fit(self, X):\n self.n_samples_fit_ = X.shape[0]\n\n # see more metric in the manual\n # https://github.com/nmslib/nmslib/tree/master/manual\n space = {\n \"euclidean\": \"l2\",\n \"cosine\": \"cosinesimil\",\n \"l1\": \"l1\",\n \"l2\": \"l2\",\n }[self.metric]\n\n self.nmslib_ = nmslib.init(method=self.method, space=space)\n self.nmslib_.addDataPointBatch(X.copy())\n self.nmslib_.createIndex()\n return self\n\n def transform(self, X):\n n_samples_transform = X.shape[0]\n\n # For compatibility reasons, as each sample is considered as its own\n # neighbor, one extra neighbor will be computed.\n n_neighbors = self.n_neighbors + 1\n\n if self.n_jobs < 0:\n # Same handling as done in joblib for negative values of n_jobs:\n # in particular, `n_jobs == -1` means \"as many threads as CPUs\".\n num_threads = joblib.cpu_count() + self.n_jobs + 1\n else:\n num_threads = self.n_jobs\n\n results = self.nmslib_.knnQueryBatch(\n X.copy(), k=n_neighbors, num_threads=num_threads\n )\n indices, distances = zip(*results)\n indices, distances = np.vstack(indices), np.vstack(distances)\n\n indptr = np.arange(0, n_samples_transform * n_neighbors + 1, n_neighbors)\n kneighbors_graph = csr_matrix(\n (distances.ravel(), indices.ravel(), indptr),\n shape=(n_samples_transform, self.n_samples_fit_),\n )\n\n return kneighbors_graph\n\n\ndef load_mnist(n_samples):\n \"\"\"Load MNIST, shuffle the data, and return only n_samples.\"\"\"\n mnist = fetch_openml(\"mnist_784\", as_frame=False)\n X, y = shuffle(mnist.data, mnist.target, random_state=2)\n return X[:n_samples] / 255, y[:n_samples]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We benchmark the different exact/approximate nearest neighbors transformers.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import time\n\nfrom sklearn.manifold import TSNE\nfrom sklearn.neighbors import KNeighborsTransformer\nfrom sklearn.pipeline import make_pipeline\n\ndatasets = [\n (\"MNIST_10000\", load_mnist(n_samples=10_000)),\n (\"MNIST_20000\", load_mnist(n_samples=20_000)),\n]\n\nn_iter = 500\nperplexity = 30\nmetric = \"euclidean\"\n# TSNE requires a certain number of neighbors which depends on the\n# perplexity parameter.\n# Add one since we include each sample as its own neighbor.\nn_neighbors = int(3.0 * perplexity + 1) + 1\n\ntsne_params = dict(\n init=\"random\", # pca not supported for sparse matrices\n perplexity=perplexity,\n method=\"barnes_hut\",\n random_state=42,\n n_iter=n_iter,\n learning_rate=\"auto\",\n)\n\ntransformers = [\n (\n \"KNeighborsTransformer\",\n KNeighborsTransformer(n_neighbors=n_neighbors, mode=\"distance\", metric=metric),\n ),\n (\n \"NMSlibTransformer\",\n NMSlibTransformer(n_neighbors=n_neighbors, metric=metric),\n ),\n (\n \"PyNNDescentTransformer\",\n PyNNDescentTransformer(\n n_neighbors=n_neighbors, metric=metric, parallel_batch_queries=True\n ),\n ),\n]\n\nfor dataset_name, (X, y) in datasets:\n msg = f\"Benchmarking on {dataset_name}:\"\n print(f\"\\n{msg}\\n\" + str(\"-\" * len(msg)))\n\n for transformer_name, transformer in transformers:\n longest = np.max([len(name) for name, model in transformers])\n start = time.time()\n transformer.fit(X)\n fit_duration = time.time() - start\n print(f\"{transformer_name:<{longest}} {fit_duration:.3f} sec (fit)\")\n start = time.time()\n Xt = transformer.transform(X)\n transform_duration = time.time() - start\n print(f\"{transformer_name:<{longest}} {transform_duration:.3f} sec (transform)\")\n if transformer_name == \"PyNNDescentTransformer\":\n start = time.time()\n Xt = transformer.transform(X)\n transform_duration = time.time() - start\n print(\n f\"{transformer_name:<{longest}} {transform_duration:.3f} sec\"\n \" (transform)\"\n )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sample output::\n\n Benchmarking on MNIST_10000:\n ----------------------------\n KNeighborsTransformer 0.007 sec (fit)\n KNeighborsTransformer 1.139 sec (transform)\n NMSlibTransformer 0.208 sec (fit)\n NMSlibTransformer 0.315 sec (transform)\n PyNNDescentTransformer 4.823 sec (fit)\n PyNNDescentTransformer 4.884 sec (transform)\n PyNNDescentTransformer 0.744 sec (transform)\n\n Benchmarking on MNIST_20000:\n ----------------------------\n KNeighborsTransformer 0.011 sec (fit)\n KNeighborsTransformer 5.769 sec (transform)\n NMSlibTransformer 0.733 sec (fit)\n NMSlibTransformer 1.077 sec (transform)\n PyNNDescentTransformer 14.448 sec (fit)\n PyNNDescentTransformer 7.103 sec (transform)\n PyNNDescentTransformer 1.759 sec (transform)\n\nNotice that the `PyNNDescentTransformer` takes more time during the first\n`fit` and the first `transform` due to the overhead of the numba just in time\ncompiler. But after the first call, the compiled Python code is kept in a\ncache by numba and subsequent calls do not suffer from this initial overhead.\nBoth :class:`~sklearn.neighbors.KNeighborsTransformer` and `NMSlibTransformer`\nare only run once here as they would show more stable `fit` and `transform`\ntimes (they don't have the cold start problem of PyNNDescentTransformer).\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\nfrom matplotlib.ticker import NullFormatter\n\ntransformers = [\n (\"TSNE with internal NearestNeighbors\", TSNE(metric=metric, **tsne_params)),\n (\n \"TSNE with KNeighborsTransformer\",\n make_pipeline(\n KNeighborsTransformer(\n n_neighbors=n_neighbors, mode=\"distance\", metric=metric\n ),\n TSNE(metric=\"precomputed\", **tsne_params),\n ),\n ),\n (\n \"TSNE with NMSlibTransformer\",\n make_pipeline(\n NMSlibTransformer(n_neighbors=n_neighbors, metric=metric),\n TSNE(metric=\"precomputed\", **tsne_params),\n ),\n ),\n]\n\n# init the plot\nnrows = len(datasets)\nncols = np.sum([1 for name, model in transformers if \"TSNE\" in name])\nfig, axes = plt.subplots(\n nrows=nrows, ncols=ncols, squeeze=False, figsize=(5 * ncols, 4 * nrows)\n)\naxes = axes.ravel()\ni_ax = 0\n\nfor dataset_name, (X, y) in datasets:\n msg = f\"Benchmarking on {dataset_name}:\"\n print(f\"\\n{msg}\\n\" + str(\"-\" * len(msg)))\n\n for transformer_name, transformer in transformers:\n longest = np.max([len(name) for name, model in transformers])\n start = time.time()\n Xt = transformer.fit_transform(X)\n transform_duration = time.time() - start\n print(\n f\"{transformer_name:<{longest}} {transform_duration:.3f} sec\"\n \" (fit_transform)\"\n )\n\n # plot TSNE embedding which should be very similar across methods\n axes[i_ax].set_title(transformer_name + \"\\non \" + dataset_name)\n axes[i_ax].scatter(\n Xt[:, 0],\n Xt[:, 1],\n c=y.astype(np.int32),\n alpha=0.2,\n cmap=plt.cm.viridis,\n )\n axes[i_ax].xaxis.set_major_formatter(NullFormatter())\n axes[i_ax].yaxis.set_major_formatter(NullFormatter())\n axes[i_ax].axis(\"tight\")\n i_ax += 1\n\nfig.tight_layout()\nplt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sample output::\n\n Benchmarking on MNIST_10000:\n ----------------------------\n TSNE with internal NearestNeighbors 24.828 sec (fit_transform)\n TSNE with KNeighborsTransformer 20.111 sec (fit_transform)\n TSNE with NMSlibTransformer 21.757 sec (fit_transform)\n\n Benchmarking on MNIST_20000:\n ----------------------------\n TSNE with internal NearestNeighbors 51.955 sec (fit_transform)\n TSNE with KNeighborsTransformer 50.994 sec (fit_transform)\n TSNE with NMSlibTransformer 43.536 sec (fit_transform)\n\nWe can observe that the default :class:`~sklearn.manifold.TSNE` estimator with\nits internal :class:`~sklearn.neighbors.NearestNeighbors` implementation is\nroughly equivalent to the pipeline with :class:`~sklearn.manifold.TSNE` and\n:class:`~sklearn.neighbors.KNeighborsTransformer` in terms of performance.\nThis is expected because both pipelines rely internally on the same\n:class:`~sklearn.neighbors.NearestNeighbors` implementation that performs\nexacts neighbors search. The approximate `NMSlibTransformer` is already\nslightly faster than the exact search on the smallest dataset but this speed\ndifference is expected to become more significant on datasets with a larger\nnumber of samples.\n\nNotice however that not all approximate search methods are guaranteed to\nimprove the speed of the default exact search method: indeed the exact search\nimplementation significantly improved since scikit-learn 1.1. Furthermore, the\nbrute-force exact search method does not require building an index at `fit`\ntime. So, to get an overall performance improvement in the context of the\n:class:`~sklearn.manifold.TSNE` pipeline, the gains of the approximate search\nat `transform` need to be larger than the extra time spent to build the\napproximate search index at `fit` time.\n\nFinally, the TSNE algorithm itself is also computationally intensive,\nirrespective of the nearest neighbors search. So speeding-up the nearest\nneighbors search step by a factor of 5 would not result in a speed up by a\nfactor of 5 for the overall pipeline.\n\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.21" } }, "nbformat": 4, "nbformat_minor": 0 }