{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Manifold learning on handwritten digits: Locally Linear Embedding, Isomap...\n\nWe illustrate various embedding techniques on the digits dataset.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Authors: The scikit-learn developers\n# SPDX-License-Identifier: BSD-3-Clause" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load digits dataset\nWe will load the digits dataset and only use six first of the ten available classes.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.datasets import load_digits\n\ndigits = load_digits(n_class=6)\nX, y = digits.data, digits.target\nn_samples, n_features = X.shape\nn_neighbors = 30" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can plot the first hundred digits from this data set.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n\nfig, axs = plt.subplots(nrows=10, ncols=10, figsize=(6, 6))\nfor idx, ax in enumerate(axs.ravel()):\n ax.imshow(X[idx].reshape((8, 8)), cmap=plt.cm.binary)\n ax.axis(\"off\")\n_ = fig.suptitle(\"A selection from the 64-dimensional digits dataset\", fontsize=16)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Helper function to plot embedding\nBelow, we will use different techniques to embed the digits dataset. We will plot\nthe projection of the original data onto each embedding. It will allow us to\ncheck whether or digits are grouped together in the embedding space, or\nscattered across it.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import numpy as np\nfrom matplotlib import offsetbox\n\nfrom sklearn.preprocessing import MinMaxScaler\n\n\ndef plot_embedding(X, title):\n _, ax = plt.subplots()\n X = MinMaxScaler().fit_transform(X)\n\n for digit in digits.target_names:\n ax.scatter(\n *X[y == digit].T,\n marker=f\"${digit}$\",\n s=60,\n color=plt.cm.Dark2(digit),\n alpha=0.425,\n zorder=2,\n )\n shown_images = np.array([[1.0, 1.0]]) # just something big\n for i in range(X.shape[0]):\n # plot every digit on the embedding\n # show an annotation box for a group of digits\n dist = np.sum((X[i] - shown_images) ** 2, 1)\n if np.min(dist) < 4e-3:\n # don't show points that are too close\n continue\n shown_images = np.concatenate([shown_images, [X[i]]], axis=0)\n imagebox = offsetbox.AnnotationBbox(\n offsetbox.OffsetImage(digits.images[i], cmap=plt.cm.gray_r), X[i]\n )\n imagebox.set(zorder=1)\n ax.add_artist(imagebox)\n\n ax.set_title(title)\n ax.axis(\"off\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Embedding techniques comparison\n\nBelow, we compare different techniques. However, there are a couple of things\nto note:\n\n* the :class:`~sklearn.ensemble.RandomTreesEmbedding` is not\n technically a manifold embedding method, as it learn a high-dimensional\n representation on which we apply a dimensionality reduction method.\n However, it is often useful to cast a dataset into a representation in\n which the classes are linearly-separable.\n* the :class:`~sklearn.discriminant_analysis.LinearDiscriminantAnalysis` and\n the :class:`~sklearn.neighbors.NeighborhoodComponentsAnalysis`, are supervised\n dimensionality reduction method, i.e. they make use of the provided labels,\n contrary to other methods.\n* the :class:`~sklearn.manifold.TSNE` is initialized with the embedding that is\n generated by PCA in this example. It ensures global stability of the embedding,\n i.e., the embedding does not depend on random initialization.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.decomposition import TruncatedSVD\nfrom sklearn.discriminant_analysis import LinearDiscriminantAnalysis\nfrom sklearn.ensemble import RandomTreesEmbedding\nfrom sklearn.manifold import (\n MDS,\n TSNE,\n Isomap,\n LocallyLinearEmbedding,\n SpectralEmbedding,\n)\nfrom sklearn.neighbors import NeighborhoodComponentsAnalysis\nfrom sklearn.pipeline import make_pipeline\nfrom sklearn.random_projection import SparseRandomProjection\n\nembeddings = {\n \"Random projection embedding\": SparseRandomProjection(\n n_components=2, random_state=42\n ),\n \"Truncated SVD embedding\": TruncatedSVD(n_components=2),\n \"Linear Discriminant Analysis embedding\": LinearDiscriminantAnalysis(\n n_components=2\n ),\n \"Isomap embedding\": Isomap(n_neighbors=n_neighbors, n_components=2),\n \"Standard LLE embedding\": LocallyLinearEmbedding(\n n_neighbors=n_neighbors, n_components=2, method=\"standard\"\n ),\n \"Modified LLE embedding\": LocallyLinearEmbedding(\n n_neighbors=n_neighbors, n_components=2, method=\"modified\"\n ),\n \"Hessian LLE embedding\": LocallyLinearEmbedding(\n n_neighbors=n_neighbors, n_components=2, method=\"hessian\"\n ),\n \"LTSA LLE embedding\": LocallyLinearEmbedding(\n n_neighbors=n_neighbors, n_components=2, method=\"ltsa\"\n ),\n \"MDS embedding\": MDS(n_components=2, n_init=1, max_iter=120, n_jobs=2),\n \"Random Trees embedding\": make_pipeline(\n RandomTreesEmbedding(n_estimators=200, max_depth=5, random_state=0),\n TruncatedSVD(n_components=2),\n ),\n \"Spectral embedding\": SpectralEmbedding(\n n_components=2, random_state=0, eigen_solver=\"arpack\"\n ),\n \"t-SNE embedding\": TSNE(\n n_components=2,\n max_iter=500,\n n_iter_without_progress=150,\n n_jobs=2,\n random_state=0,\n ),\n \"NCA embedding\": NeighborhoodComponentsAnalysis(\n n_components=2, init=\"pca\", random_state=0\n ),\n}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once we declared all the methods of interest, we can run and perform the projection\nof the original data. We will store the projected data as well as the computational\ntime needed to perform each projection.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from time import time\n\nprojections, timing = {}, {}\nfor name, transformer in embeddings.items():\n if name.startswith(\"Linear Discriminant Analysis\"):\n data = X.copy()\n data.flat[:: X.shape[1] + 1] += 0.01 # Make X invertible\n else:\n data = X\n\n print(f\"Computing {name}...\")\n start_time = time()\n projections[name] = transformer.fit_transform(data, y)\n timing[name] = time() - start_time" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we can plot the resulting projection given by each method.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "for name in timing:\n title = f\"{name} (time {timing[name]:.3f}s)\"\n plot_embedding(projections[name], title)\n\nplt.show()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.21" } }, "nbformat": 4, "nbformat_minor": 0 }