{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Unsupervised Learning Part 2 -- Clustering" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Clustering is the task of gathering samples into groups of similar\n", "samples according to some predefined similarity or distance (dissimilarity)\n", "measure, such as the Euclidean distance.\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this section we will explore a basic clustering task on some synthetic and real-world datasets.\n", "\n", "Here are some common applications of clustering algorithms:\n", "\n", "- Compression for data reduction\n", "- Summarizing data as a reprocessing step for recommender systems\n", "- Similarly:\n", " - grouping related web news (e.g. Google News) and web search results\n", " - grouping related stock quotes for investment portfolio management\n", " - building customer profiles for market analysis\n", "- Building a code book of prototype samples for unsupervised feature extraction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's start by creating a simple, 2-dimensional, synthetic dataset:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import make_blobs\n", "\n", "X, y = make_blobs(random_state=42)\n", "X.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(8, 8))\n", "plt.scatter(X[:, 0], X[:, 1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the scatter plot above, we can see three separate groups of data points and we would like to recover them using clustering -- think of \"discovering\" the class labels that we already take for granted in a classification task.\n", "\n", "Even if the groups are obvious in the data, it is hard to find them when the data lives in a high-dimensional space, which we can't visualize in a single histogram or scatterplot." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we will use one of the simplest clustering algorithms, K-means.\n", "This is an iterative algorithm which searches for three cluster\n", "centers such that the distance from each point to its cluster is\n", "minimized. The standard implementation of K-means uses the Euclidean distance, which is why we want to make sure that all our variables are measured on the same scale if we are working with real-world datastets. In the previous notebook, we talked about one technique to achieve this, namely, standardization.\n", "\n", "
\n", "
\n", " Question:\n", " \n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn.cluster import KMeans\n", "\n", "kmeans = KMeans(n_clusters=3, random_state=42)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can get the cluster labels either by calling fit and then accessing the \n", "``labels_`` attribute of the K means estimator, or by calling ``fit_predict``.\n", "Either way, the result contains the ID of the cluster that each point is assigned to." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "labels = kmeans.fit_predict(X)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "labels" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.all(y == labels)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's visualize the assignments that have been found" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(8, 8))\n", "plt.scatter(X[:, 0], X[:, 1], c=labels)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Compared to the true labels:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(8, 8))\n", "plt.scatter(X[:, 0], X[:, 1], c=y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, we are probably satisfied with the clustering results. But in general we might want to have a more quantitative evaluation. How about comparing our cluster labels with the ground truth we got when generating the blobs?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import confusion_matrix, accuracy_score\n", "\n", "print('Accuracy score:', accuracy_score(y, labels))\n", "print(confusion_matrix(y, labels))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.mean(y == labels)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " EXERCISE:\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Even though we recovered the partitioning of the data into clusters perfectly, the cluster IDs we assigned were arbitrary,\n", "and we can not hope to recover them. Therefore, we must use a different scoring metric, such as ``adjusted_rand_score``, which is invariant to permutations of the labels:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import adjusted_rand_score\n", "\n", "adjusted_rand_score(y, labels)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One of the \"short-comings\" of K-means is that we have to specify the number of clusters, which we often don't know *apriori*. For example, let's have a look what happens if we set the number of clusters to 2 in our synthetic 3-blob dataset:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "kmeans = KMeans(n_clusters=2, random_state=42)\n", "labels = kmeans.fit_predict(X)\n", "plt.figure(figsize=(8, 8))\n", "plt.scatter(X[:, 0], X[:, 1], c=labels)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "kmeans.cluster_centers_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### The Elbow Method\n", "\n", "The Elbow method is a \"rule-of-thumb\" approach to finding the optimal number of clusters. Here, we look at the cluster dispersion for different values of k:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "distortions = []\n", "for i in range(1, 11):\n", " km = KMeans(n_clusters=i, \n", " random_state=0)\n", " km.fit(X)\n", " distortions.append(km.inertia_)\n", "\n", "plt.plot(range(1, 11), distortions, marker='o')\n", "plt.xlabel('Number of clusters')\n", "plt.ylabel('Distortion')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then, we pick the value that resembles the \"pit of an elbow.\" As we can see, this would be k=3 in this case, which makes sense given our visual expection of the dataset previously." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Clustering comes with assumptions**: A clustering algorithm finds clusters by making assumptions with samples should be grouped together. Each algorithm makes different assumptions and the quality and interpretability of your results will depend on whether the assumptions are satisfied for your goal. For K-means clustering, the model is that all clusters have equal, spherical variance.\n", "\n", "**In general, there is no guarantee that structure found by a clustering algorithm has anything to do with what you were interested in**.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can easily create a dataset that has non-isotropic clusters, on which kmeans will fail:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(12, 12))\n", "\n", "n_samples = 1500\n", "random_state = 170\n", "X, y = make_blobs(n_samples=n_samples, random_state=random_state)\n", "\n", "# Incorrect number of clusters\n", "y_pred = KMeans(n_clusters=2, random_state=random_state).fit_predict(X)\n", "\n", "plt.subplot(221)\n", "plt.scatter(X[:, 0], X[:, 1], c=y_pred)\n", "plt.title(\"Incorrect Number of Blobs\")\n", "\n", "# Anisotropicly distributed data\n", "transformation = [[0.60834549, -0.63667341], [-0.40887718, 0.85253229]]\n", "X_aniso = np.dot(X, transformation)\n", "y_pred = KMeans(n_clusters=3, random_state=random_state).fit_predict(X_aniso)\n", "\n", "plt.subplot(222)\n", "plt.scatter(X_aniso[:, 0], X_aniso[:, 1], c=y_pred)\n", "plt.title(\"Anisotropicly Distributed Blobs\")\n", "\n", "# Different variance\n", "X_varied, y_varied = make_blobs(n_samples=n_samples,\n", " cluster_std=[1.0, 2.5, 0.5],\n", " random_state=random_state)\n", "y_pred = KMeans(n_clusters=3, random_state=random_state).fit_predict(X_varied)\n", "\n", "plt.subplot(223)\n", "plt.scatter(X_varied[:, 0], X_varied[:, 1], c=y_pred)\n", "plt.title(\"Unequal Variance\")\n", "\n", "# Unevenly sized blobs\n", "X_filtered = np.vstack((X[y == 0][:500], X[y == 1][:100], X[y == 2][:10]))\n", "y_pred = KMeans(n_clusters=3,\n", " random_state=random_state).fit_predict(X_filtered)\n", "\n", "plt.subplot(224)\n", "plt.scatter(X_filtered[:, 0], X_filtered[:, 1], c=y_pred)\n", "plt.title(\"Unevenly Sized Blobs\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Some Notable Clustering Routines" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following are two well-known clustering algorithms. \n", "\n", "- `sklearn.cluster.KMeans`:
\n", " The simplest, yet effective clustering algorithm. Needs to be provided with the\n", " number of clusters in advance, and assumes that the data is normalized as input\n", " (but use a PCA model as preprocessor).\n", "- `sklearn.cluster.MeanShift`:
\n", " Can find better looking clusters than KMeans but is not scalable to high number of samples.\n", "- `sklearn.cluster.DBSCAN`:
\n", " Can detect irregularly shaped clusters based on density, i.e. sparse regions in\n", " the input space are likely to become inter-cluster boundaries. Can also detect\n", " outliers (samples that are not part of a cluster).\n", "- `sklearn.cluster.AffinityPropagation`:
\n", " Clustering algorithm based on message passing between data points.\n", "- `sklearn.cluster.SpectralClustering`:
\n", " KMeans applied to a projection of the normalized graph Laplacian: finds\n", " normalized graph cuts if the affinity matrix is interpreted as an adjacency matrix of a graph.\n", "- `sklearn.cluster.Ward`:
\n", " Ward implements hierarchical clustering based on the Ward algorithm,\n", " a variance-minimizing approach. At each step, it minimizes the sum of\n", " squared differences within all clusters (inertia criterion).\n", "\n", "Of these, Ward, SpectralClustering, DBSCAN and Affinity propagation can also work with precomputed similarity matrices." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " EXERCISE: digits clustering:\n", " \n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn.datasets import load_digits\n", "digits = load_digits()\n", "# ..." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# %load solutions/08B_digits_clustering.py" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.6" } }, "nbformat": 4, "nbformat_minor": 2 }