{
 "metadata": {
  "name": ""
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "Unsupervised Learning: Dimensionality Reduction and Visualization"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Unsupervised learning is interested in situations in which X is available, but not y: data without labels. A typical use case is to find hiden structure in the data."
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Dimensionality Reduction: PCA"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Dimensionality reduction is the task of deriving a set of new\n",
      "artificial features that is smaller than the original feature\n",
      "set while retaining most of the variance of the original data.\n",
      "Here we'll use a common but powerful dimensionality reduction\n",
      "technique called Principal Component Analysis (PCA).\n",
      "We'll perform PCA on the iris dataset that we saw before:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from sklearn.datasets import load_iris\n",
      "iris = load_iris()\n",
      "X = iris.data\n",
      "y = iris.target"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "PCA is performed using linear combinations of the original features\n",
      "using a truncated Singular Value Decomposition of the matrix X so\n",
      "as to project the data onto a base of the top singular vectors.\n",
      "If the number of retained components is 2 or 3, PCA can be used\n",
      "to visualize the dataset."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from sklearn.decomposition import PCA\n",
      "pca = PCA(n_components=2, whiten=True)\n",
      "pca.fit(X)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Once fitted, the pca model exposes the singular vectors in the components_ attribute:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "pca.components_"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Other attributes are available as well:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "pca.explained_variance_ratio_"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "pca.explained_variance_ratio_.sum()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Let us project the iris dataset along those first two dimensions:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "X_pca = pca.transform(X)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "PCA `normalizes` and `whitens` the data, which means that the data\n",
      "is now centered on both components with unit variance:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "X_pca.mean(axis=0)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "X_pca.std(axis=0)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Furthermore, the samples components do no longer carry any linear correlation:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import numpy as np\n",
      "np.corrcoef(X_pca.T)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We can visualize the projection using pylab"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%matplotlib inline\n",
      "import matplotlib.pyplot as plt\n",
      "\n",
      "target_ids = range(len(iris.target_names))\n",
      "plt.figure()\n",
      "for i, c, label in zip(target_ids, 'rgbcmykw', iris.target_names):\n",
      "    plt.scatter(X_pca[y == i, 0], X_pca[y == i, 1],\n",
      "               c=c, label=label)\n",
      "plt.legend()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Note that this projection was determined *without* any information about the\n",
      "labels (represented by the colors): this is the sense in which the learning\n",
      "is **unsupervised**.  Nevertheless, we see that the projection gives us insight\n",
      "into the distribution of the different flowers in parameter space: notably,\n",
      "*iris setosa* is much more distinct than the other two species."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Note also that the default implementation of PCA computes the\n",
      "singular value decomposition (SVD) of the full\n",
      "data matrix, which is not scalable when both ``n_samples`` and\n",
      "``n_features`` are big (more that a few thousands).\n",
      "If you are interested in a number of components that is much\n",
      "smaller than both ``n_samples`` and ``n_features``, consider using\n",
      "`sklearn.decomposition.RandomizedPCA` instead."
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Manifold Learning"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "One weakness of PCA is that it cannot detect non-linear features.  A set\n",
      "of algorithms known as *Manifold Learning* have been developed to address\n",
      "this deficiency.  A canonical dataset used in Manifold learning is the\n",
      "*S-curve*, which we briefly saw in an earlier section:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from sklearn.datasets import make_s_curve\n",
      "X, y = make_s_curve(n_samples=1000)\n",
      "\n",
      "from mpl_toolkits.mplot3d import Axes3D\n",
      "ax = plt.axes(projection='3d')\n",
      "\n",
      "ax.scatter3D(X[:, 0], X[:, 1], X[:, 2], c=y)\n",
      "ax.view_init(10, -60)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "This is a 2-dimensional dataset embedded in three dimensions, but it is embedded\n",
      "in such a way that PCA cannot discover the underlying data orientation:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "X_pca = PCA(n_components=2).fit_transform(X)\n",
      "plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Manifold learning algorithms, however, available in the ``sklearn.manifold``\n",
      "submodule, are able to recover the underlying 2-dimensional manifold:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from sklearn.manifold import LocallyLinearEmbedding, Isomap\n",
      "lle = LocallyLinearEmbedding(n_neighbors=15, n_components=2, method='modified')\n",
      "X_lle = lle.fit_transform(X)\n",
      "plt.scatter(X_lle[:, 0], X_lle[:, 1], c=y)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "iso = Isomap(n_neighbors=15, n_components=2)\n",
      "X_iso = iso.fit_transform(X)\n",
      "plt.scatter(X_iso[:, 0], X_iso[:, 1], c=y)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Exercise: Dimension reduction of digits"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Apply PCA, LocallyLinearEmbedding, and Isomap to project the data to two dimensions.\n",
      "Which visualization technique separates the classes most cleanly?"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from sklearn.datasets import load_digits\n",
      "digits = load_digits()\n",
      "# ..."
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Solution:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%load solutions/08A_digits_projection.py"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    }
   ],
   "metadata": {}
  }
 ]
}