{ "metadata": { "name": "08.1_unsupervised_dimreduction" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Unsupervised Learning: Dimensionality Reduction and Visualization" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Unsupervised learning is interested in situations in which X is available, but not y: data without labels.\n", "\n", "A typical use case is to find hiden structure in the data.\n", "\n", "Previously we worked on visualizing the iris data by plotting\n", "pairs of dimensions by trial and error, until we arrived at\n", "the best pair of dimensions for our dataset. Here we will\n", "use an unsupervised *dimensionality reduction* algorithm\n", "to accomplish this more automatically." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By the end of this section you will\n", "\n", "- Know how to instantiate and train an unsupervised dimensionality reduction algorithm:\n", " Principal Component Analysis (PCA)\n", "- Know how to use PCA to visualize high-dimensional data" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Dimensionality Reduction: PCA" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Dimensionality reduction is the task of deriving a set of new\n", "artificial features that is smaller than the original feature\n", "set while retaining most of the variance of the original data.\n", "Here we'll use a common but powerful dimensionality reduction\n", "technique called Principal Component Analysis (PCA).\n", "We'll perform PCA on the iris dataset that we saw before:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.datasets import load_iris\n", "iris = load_iris()\n", "X = iris.data\n", "y = iris.target" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "PCA is performed using linear combinations of the original features\n", "using a truncated Singular Value Decomposition of the matrix X so\n", "as to project the data onto a base of the top singular vectors.\n", "If the number of retained components is 2 or 3, PCA can be used\n", "to visualize the dataset." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.decomposition import PCA\n", "pca = PCA(n_components=2, whiten=True)\n", "pca.fit(X)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once fitted, the pca model exposes the singular vectors in the components_ attribute:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "pca.components_" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Other attributes are available as well:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "pca.explained_variance_ratio_" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "pca.explained_variance_ratio_.sum()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us project the iris dataset along those first two dimensions:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "X_pca = pca.transform(X)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "PCA `normalizes` and `whitens` the data, which means that the data\n", "is now centered on both components with unit variance:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import numpy as np\n", "np.round(X_pca.mean(axis=0), decimals=5)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "np.round(X_pca.std(axis=0), decimals=5)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Furthermore, the samples components do no longer carry any linear correlation:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "np.corrcoef(X_pca.T)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can visualize the projection using pylab, but first\n", "let's make sure our ipython notebook is in pylab inline mode" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%pylab inline" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can visualize the results using the following utility function:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import matplotlib.pyplot as plt\n", "from itertools import cycle\n", "\n", "def plot_PCA_2D(data, target, target_names):\n", " colors = cycle('rgbcmykw')\n", " target_ids = range(len(target_names))\n", " plt.figure()\n", " for i, c, label in zip(target_ids, colors, target_names):\n", " plt.scatter(data[target == i, 0], data[target == i, 1],\n", " c=c, label=label)\n", " plt.legend()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now calling this function for our data, we see the plot:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "plot_PCA_2D(X_pca, iris.target, iris.target_names)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that this projection was determined *without* any information about the\n", "labels (represented by the colors): this is the sense in which the learning\n", "is **unsupervised**. Nevertheless, we see that the projection gives us insight\n", "into the distribution of the different flowers in parameter space: notably,\n", "*iris setosa* is much more distinct than the other two species." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note also that the default implementation of PCA computes the\n", "singular value decomposition (SVD) of the full\n", "data matrix, which is not scalable when both ``n_samples`` and\n", "``n_features`` are big (more that a few thousands).\n", "If you are interested in a number of components that is much\n", "smaller than both ``n_samples`` and ``n_features``, consider using\n", "`sklearn.decomposition.RandomizedPCA` instead." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Other dimensionality reduction techniques which are useful to know about:\n", "\n", "- [sklearn.decomposition.PCA](http://scikit-learn.org/0.13/modules/generated/sklearn.decomposition.PCA.html): \n", " Principal Component Analysis\n", "- [sklearn.decomposition.RandomizedPCA](http://scikit-learn.org/0.13/modules/generated/sklearn.decomposition.RandomizedPCA.html):\n", " fast non-exact PCA implementation based on a randomized algorithm\n", "- [sklearn.decomposition.SparsePCA](http://scikit-learn.org/0.13/modules/generated/sklearn.decomposition.SparsePCA.html):\n", " PCA variant including L1 penalty for sparsity\n", "- [sklearn.decomposition.FastICA](http://scikit-learn.org/0.13/modules/generated/sklearn.decomposition.FastICA.html):\n", " Independent Component Analysis\n", "- [sklearn.decomposition.NMF](http://scikit-learn.org/0.13/modules/generated/sklearn.decomposition.NMF.html):\n", " non-negative matrix factorization\n", "- [sklearn.manifold.LocallyLinearEmbedding](http://scikit-learn.org/0.13/modules/generated/sklearn.manifold.LocallyLinearEmbedding.html):\n", " nonlinear manifold learning technique based on local neighborhood geometry\n", "- [sklearn.manifold.IsoMap](http://scikit-learn.org/0.13/modules/generated/sklearn.manifold.Isomap.html):\n", " nonlinear manifold learning technique based on a sparse graph algorithm" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Manifold Learning" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One weakness of PCA is that it cannot detect non-linear features. A set\n", "of algorithms known as *Manifold Learning* have been developed to address\n", "this deficiency. A canonical dataset used in Manifold learning is the\n", "*S-curve*, which we briefly saw in an earlier section:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.datasets import make_s_curve\n", "X, y = make_s_curve(n_samples=1000)\n", "\n", "from mpl_toolkits.mplot3d import Axes3D\n", "ax = plt.axes(projection='3d')\n", "\n", "ax.scatter3D(X[:, 0], X[:, 1], X[:, 2], c=y)\n", "ax.view_init(10, -60)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is a 2-dimensional dataset embedded in three dimensions, but it is embedded\n", "in such a way that PCA cannot discover the underlying data orientation:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "X_pca = PCA(n_components=2).fit_transform(X)\n", "plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Manifold learning algorithms, however, available in the ``sklearn.manifold``\n", "submodule, are able to recover the underlying 2-dimensional manifold:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.manifold import LocallyLinearEmbedding, Isomap\n", "lle = LocallyLinearEmbedding(n_neighbors=15, n_components=2, method='modified')\n", "X_lle = lle.fit_transform(X)\n", "plt.scatter(X_lle[:, 0], X_lle[:, 1], c=y)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "iso = Isomap(n_neighbors=15, n_components=2)\n", "X_iso = iso.fit_transform(X)\n", "plt.scatter(X_iso[:, 0], X_iso[:, 1], c=y)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Exercise: Dimension reduction of digits" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Apply PCA LocallyLinearEmbedding, and Isomap to project the data to two dimensions.\n", "Which visualization technique separates the classes most cleanly?" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.datasets import load_digits\n", "digits = load_digits()\n", "# ..." ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Solution:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%load solutions/08A_digits_projection.py" ], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }