{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "> This is one of the 100 recipes of the [IPython Cookbook](http://ipython-books.github.io/), the definitive guide to high-performance scientific computing and data science in Python." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 8.7. Reducing the dimensionality of a data with a Principal Component Analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. We import NumPy, matplotlib, and scikit-learn." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import numpy as np\n", "import sklearn\n", "import sklearn.decomposition as dec\n", "import sklearn.datasets as ds\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2. The Iris flower dataset is available in the *datasets* module of scikit-learn." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "iris = ds.load_iris()\n", "X = iris.data\n", "y = iris.target\n", "print(X.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "3. Each row contains four parameters related to the morphology of the flower. Let's display the first two components in two dimensions. The color reflects the iris variety of the flower (the label, between 0 and 2)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "plt.figure(figsize=(6,3));\n", "plt.scatter(X[:,0], X[:,1], c=y,\n", " s=30, cmap=plt.cm.rainbow);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "4. We now apply PCA on the dataset to get the transformed matrix. This operation can be done in a single line with scikit-learn: we instantiate a `PCA` model, and call the `fit_transform` method. This function computes the principal components first, and projects the data then." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "X_bis = dec.PCA().fit_transform(X)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "5. We now display the same dataset, but in a new coordinate system (or equivalently, a linearly transformed version of the initial dataset)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "plt.figure(figsize=(6,3));\n", "plt.scatter(X_bis[:,0], X_bis[:,1], c=y,\n", " s=30, cmap=plt.cm.rainbow);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Points belonging to the same classes are now grouped together, even though the `PCA` estimator dit *not* use the labels. The PCA was able to find a projection maximizing the variance, which corresponds here to a projection where the classes are well separated." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "6. The `scikit.decomposition` module contains several variants of the classic `PCA` estimator: `ProbabilisticPCA`, `SparsePCA`, `RandomizedPCA`, `KernelPCA`... As an example, let's take a look at `KernelPCA`, a non-linear version of PCA." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "X_ter = dec.KernelPCA(kernel='rbf').fit_transform(X)\n", "plt.figure(figsize=(6,3));\n", "plt.scatter(X_ter[:,0], X_ter[:,1], c=y, s=30, cmap=plt.cm.rainbow);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> You'll find all the explanations, figures, references, and much more in the book (to be released later this summer).\n", "\n", "> [IPython Cookbook](http://ipython-books.github.io/), by [Cyrille Rossant](http://cyrille.rossant.net), Packt Publishing, 2014 (500 pages)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.4.2" } }, "nbformat": 4, "nbformat_minor": 0 }