{ "metadata": { "name": "07_iris_clustering" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Clustering of Iris Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Clustering is the task of gathering samples into groups of similar\n", "samples according to some predefined similarity or dissimilarity\n", "measure (such as the Euclidean distance).\n", "In this section we will explore a basic clustering task on the\n", "iris data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By the end of this section you will\n", "\n", "- Know how to instantiate and train KMeans, an unsupervised clustering algorithm\n", "- Know several other interesting clustering algorithms within scikit-learn" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's re-use the results of the 2D PCA of the iris dataset in order to\n", "explore clustering. First we need to repeat some of the code from the\n", "previous notebook" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# make sure ipython inline mode is activated\n", "%pylab inline" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# all of this is copied from the previous notebook, '06_iris_dimensionality' \n", "from sklearn.datasets import load_iris\n", "from sklearn.decomposition import PCA\n", "import pylab as pl\n", "from itertools import cycle\n", "\n", "iris = load_iris()\n", "X = iris.data\n", "y = iris.target\n", "\n", "pca = PCA(n_components=2, whiten=True).fit(X)\n", "X_pca = pca.transform(X)\n", "\n", "def plot_2D(data, target, target_names):\n", " colors = cycle('rgbcmykw')\n", " target_ids = range(len(target_names))\n", " pl.figure()\n", " for i, c, label in zip(target_ids, colors, target_names):\n", " pl.scatter(data[target == i, 0], data[target == i, 1],\n", " c=c, label=label)\n", " pl.legend()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To remind ourselves what we're looking at, let's again plot the PCA components\n", "we defined in the last notebook:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "plot_2D(X_pca, iris.target, iris.target_names)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we will use one of the simplest clustering algorithms, K-means.\n", "This is an iterative algorithm which searches for three cluster\n", "centers such that the distance from each point to its cluster is\n", "minimizied. First, let's step back for a second,\n", "look at the above plot, and think about what this will do.\n", "The algorithm will look for three cluster centers, and label the\n", "points according to which cluster center they're closest to.\n", "\n", "**Question:** what would you expect the output to look like?" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.cluster import KMeans\n", "from numpy.random import RandomState\n", "rng = RandomState(42)\n", "\n", "kmeans = KMeans(n_clusters=3, random_state=rng)\n", "kmeans.fit(X_pca)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "import numpy as np\n", "np.round(kmeans.cluster_centers_, decimals=2)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The ``labels_`` attribute of the K means estimator contains the ID of the\n", "cluster that each point is assigned to." ] }, { "cell_type": "code", "collapsed": false, "input": [ "kmeans.labels_" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The K-means algorithm has been used to infer cluster labels for the\n", "points. Let's call the ``plot_2D`` function again, but color the points\n", "based on the cluster labels rather than the iris species." ] }, { "cell_type": "code", "collapsed": false, "input": [ "plot_2D(X_pca, kmeans.labels_, [\"c0\", \"c1\", \"c2\"])\n", "plt.title('K-Means labels')\n", "\n", "plot_2D(X_pca, iris.target, iris.target_names)\n", "plt.title('True labels')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Some Notable Clustering Routines" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following are two well-known clustering algorithms. Like most unsupervised learning\n", "models in the scikit, they expect the data to be clustered to have the shape `(n_samples, n_features)`:\n", "\n", "- `sklearn.cluster.KMeans`:
\n", " The simplest, yet effective clustering algorithm. Needs to be provided with the\n", " number of clusters in advance, and assumes that the data is normalized as input\n", " (but use a PCA model as preprocessor).\n", "- `sklearn.cluster.MeanShift`:
\n", " Can find better looking clusters than KMeans but is not scalable to high number of samples.\n", "- `sklearn.cluster.DBSCAN`:
\n", " Can detect irregularly shaped clusters based on density, i.e. sparse regions in\n", " the input space are likely to become inter-cluster boundaries. Can also detect\n", " outliers (samples that are not part of a cluster).\n", "\n", "Other clustering algorithms do not work with a data array of shape (n_samples, n_features)\n", "but directly with a precomputed affinity matrix of shape (n_samples, n_samples):\n", "\n", "- `sklearn.cluster.AffinityPropagation`:
\n", " Clustering algorithm based on message passing between data points.\n", "- `sklearn.cluster.SpectralClustering`:
\n", " KMeans applied to a projection of the normalized graph Laplacian: finds\n", " normalized graph cuts if the affinity matrix is interpreted as an adjacency matrix of a graph.\n", "- `sklearn.cluster.Ward`:
\n", " Ward implements hierarchical clustering based on the Ward algorithm,\n", " a variance-minimizing approach. At each step, it minimizes the sum of\n", " squared differences within all clusters (inertia criterion).\n", "- `sklearn.cluster.DBSCAN`:
\n", " DBSCAN can work with either an array of samples or an affinity matrix." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Some Applications of Clustering" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here are some common applications of clustering algorithms:\n", "\n", "- Building customer profiles for market analysis\n", "- Grouping related web news (e.g. Google News) and web search results\n", "- Grouping related stock quotes for investment portfolio management\n", "- Can be used as a preprocessing step for recommender systems\n", "- Can be used to build a code book of prototype samples for unsupervised feature extraction for supervised learning algorithms\n" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Exercise" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Perform the K-Means cluster search again, but this time learn the\n", "clusters using the full data matrix ``X``, rather than the projected\n", "matrix ``X_pca``.\n", "\n", "Does this change the results?\n", "\n", "Plot the results (you can still use X_pca for visualization, but plot\n", "the labels derived from the full 4-D set).\n", "Do the 4D K-means labels look closer to the true labels?" ] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }