{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Unsupervised Learning: Dimensionality Reduction and Visualization" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Unsupervised learning is interested in situations in which X is available, but not y: data without labels. A typical use case is to find hiden structure in the data." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Dimensionality Reduction: PCA" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Dimensionality reduction is the task of deriving a set of new\n", "artificial features that is smaller than the original feature\n", "set while retaining most of the variance of the original data.\n", "Here we'll use a common but powerful dimensionality reduction\n", "technique called Principal Component Analysis (PCA).\n", "We'll perform PCA on the iris dataset that we saw before:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.datasets import load_iris\n", "iris = load_iris()\n", "X = iris.data\n", "y = iris.target" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "PCA is performed using linear combinations of the original features\n", "using a truncated Singular Value Decomposition of the matrix X so\n", "as to project the data onto a base of the top singular vectors.\n", "If the number of retained components is 2 or 3, PCA can be used\n", "to visualize the dataset." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.decomposition import PCA\n", "pca = PCA(n_components=2, whiten=True)\n", "pca.fit(X)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once fitted, the pca model exposes the singular vectors in the components_ attribute:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "pca.components_" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Other attributes are available as well:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "pca.explained_variance_ratio_" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "pca.explained_variance_ratio_.sum()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us project the iris dataset along those first two dimensions:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "X_pca = pca.transform(X)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "PCA `normalizes` and `whitens` the data, which means that the data\n", "is now centered on both components with unit variance:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "X_pca.mean(axis=0)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "X_pca.std(axis=0)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Furthermore, the samples components do no longer carry any linear correlation:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import numpy as np\n", "np.corrcoef(X_pca.T)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can visualize the projection using pylab" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "\n", "target_ids = range(len(iris.target_names))\n", "plt.figure()\n", "for i, c, label in zip(target_ids, 'rgbcmykw', iris.target_names):\n", " plt.scatter(X_pca[y == i, 0], X_pca[y == i, 1],\n", " c=c, label=label)\n", "plt.legend()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that this projection was determined *without* any information about the\n", "labels (represented by the colors): this is the sense in which the learning\n", "is **unsupervised**. Nevertheless, we see that the projection gives us insight\n", "into the distribution of the different flowers in parameter space: notably,\n", "*iris setosa* is much more distinct than the other two species." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note also that the default implementation of PCA computes the\n", "singular value decomposition (SVD) of the full\n", "data matrix, which is not scalable when both ``n_samples`` and\n", "``n_features`` are big (more that a few thousands).\n", "If you are interested in a number of components that is much\n", "smaller than both ``n_samples`` and ``n_features``, consider using\n", "`sklearn.decomposition.RandomizedPCA` instead." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Manifold Learning" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One weakness of PCA is that it cannot detect non-linear features. A set\n", "of algorithms known as *Manifold Learning* have been developed to address\n", "this deficiency. A canonical dataset used in Manifold learning is the\n", "*S-curve*, which we briefly saw in an earlier section:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.datasets import make_s_curve\n", "X, y = make_s_curve(n_samples=1000)\n", "\n", "from mpl_toolkits.mplot3d import Axes3D\n", "ax = plt.axes(projection='3d')\n", "\n", "ax.scatter3D(X[:, 0], X[:, 1], X[:, 2], c=y)\n", "ax.view_init(10, -60)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is a 2-dimensional dataset embedded in three dimensions, but it is embedded\n", "in such a way that PCA cannot discover the underlying data orientation:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "X_pca = PCA(n_components=2).fit_transform(X)\n", "plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Manifold learning algorithms, however, available in the ``sklearn.manifold``\n", "submodule, are able to recover the underlying 2-dimensional manifold:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.manifold import LocallyLinearEmbedding, Isomap\n", "lle = LocallyLinearEmbedding(n_neighbors=15, n_components=2, method='modified')\n", "X_lle = lle.fit_transform(X)\n", "plt.scatter(X_lle[:, 0], X_lle[:, 1], c=y)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "iso = Isomap(n_neighbors=15, n_components=2)\n", "X_iso = iso.fit_transform(X)\n", "plt.scatter(X_iso[:, 0], X_iso[:, 1], c=y)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Exercise: Dimension reduction of digits" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Apply PCA, LocallyLinearEmbedding, and Isomap to project the data to two dimensions.\n", "Which visualization technique separates the classes most cleanly?" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.datasets import load_digits\n", "digits = load_digits()\n", "# ..." ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Solution:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%load solutions/08A_digits_projection.py" ], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }