{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Demonstration of k-means assumptions\n\nThis example is meant to illustrate situations where k-means produces\nunintuitive and possibly undesirable clusters.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Authors: The scikit-learn developers\n# SPDX-License-Identifier: BSD-3-Clause" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data generation\n\nThe function :func:`~sklearn.datasets.make_blobs` generates isotropic\n(spherical) gaussian blobs. To obtain anisotropic (elliptical) gaussian blobs\none has to define a linear `transformation`.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import numpy as np\n\nfrom sklearn.datasets import make_blobs\n\nn_samples = 1500\nrandom_state = 170\ntransformation = [[0.60834549, -0.63667341], [-0.40887718, 0.85253229]]\n\nX, y = make_blobs(n_samples=n_samples, random_state=random_state)\nX_aniso = np.dot(X, transformation) # Anisotropic blobs\nX_varied, y_varied = make_blobs(\n n_samples=n_samples, cluster_std=[1.0, 2.5, 0.5], random_state=random_state\n) # Unequal variance\nX_filtered = np.vstack(\n (X[y == 0][:500], X[y == 1][:100], X[y == 2][:10])\n) # Unevenly sized blobs\ny_filtered = [0] * 500 + [1] * 100 + [2] * 10" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can visualize the resulting data:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n\nfig, axs = plt.subplots(nrows=2, ncols=2, figsize=(12, 12))\n\naxs[0, 0].scatter(X[:, 0], X[:, 1], c=y)\naxs[0, 0].set_title(\"Mixture of Gaussian Blobs\")\n\naxs[0, 1].scatter(X_aniso[:, 0], X_aniso[:, 1], c=y)\naxs[0, 1].set_title(\"Anisotropically Distributed Blobs\")\n\naxs[1, 0].scatter(X_varied[:, 0], X_varied[:, 1], c=y_varied)\naxs[1, 0].set_title(\"Unequal Variance\")\n\naxs[1, 1].scatter(X_filtered[:, 0], X_filtered[:, 1], c=y_filtered)\naxs[1, 1].set_title(\"Unevenly Sized Blobs\")\n\nplt.suptitle(\"Ground truth clusters\").set_y(0.95)\nplt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Fit models and plot results\n\nThe previously generated data is now used to show how\n:class:`~sklearn.cluster.KMeans` behaves in the following scenarios:\n\n- Non-optimal number of clusters: in a real setting there is no uniquely\n defined **true** number of clusters. An appropriate number of clusters has\n to be decided from data-based criteria and knowledge of the intended goal.\n- Anisotropically distributed blobs: k-means consists of minimizing sample's\n euclidean distances to the centroid of the cluster they are assigned to. As\n a consequence, k-means is more appropriate for clusters that are isotropic\n and normally distributed (i.e. spherical gaussians).\n- Unequal variance: k-means is equivalent to taking the maximum likelihood\n estimator for a \"mixture\" of k gaussian distributions with the same\n variances but with possibly different means.\n- Unevenly sized blobs: there is no theoretical result about k-means that\n states that it requires similar cluster sizes to perform well, yet\n minimizing euclidean distances does mean that the more sparse and\n high-dimensional the problem is, the higher is the need to run the algorithm\n with different centroid seeds to ensure a global minimal inertia.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.cluster import KMeans\n\ncommon_params = {\n \"n_init\": \"auto\",\n \"random_state\": random_state,\n}\n\nfig, axs = plt.subplots(nrows=2, ncols=2, figsize=(12, 12))\n\ny_pred = KMeans(n_clusters=2, **common_params).fit_predict(X)\naxs[0, 0].scatter(X[:, 0], X[:, 1], c=y_pred)\naxs[0, 0].set_title(\"Non-optimal Number of Clusters\")\n\ny_pred = KMeans(n_clusters=3, **common_params).fit_predict(X_aniso)\naxs[0, 1].scatter(X_aniso[:, 0], X_aniso[:, 1], c=y_pred)\naxs[0, 1].set_title(\"Anisotropically Distributed Blobs\")\n\ny_pred = KMeans(n_clusters=3, **common_params).fit_predict(X_varied)\naxs[1, 0].scatter(X_varied[:, 0], X_varied[:, 1], c=y_pred)\naxs[1, 0].set_title(\"Unequal Variance\")\n\ny_pred = KMeans(n_clusters=3, **common_params).fit_predict(X_filtered)\naxs[1, 1].scatter(X_filtered[:, 0], X_filtered[:, 1], c=y_pred)\naxs[1, 1].set_title(\"Unevenly Sized Blobs\")\n\nplt.suptitle(\"Unexpected KMeans clusters\").set_y(0.95)\nplt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Possible solutions\n\nFor an example on how to find a correct number of blobs, see\n`sphx_glr_auto_examples_cluster_plot_kmeans_silhouette_analysis.py`.\nIn this case it suffices to set `n_clusters=3`.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "y_pred = KMeans(n_clusters=3, **common_params).fit_predict(X)\nplt.scatter(X[:, 0], X[:, 1], c=y_pred)\nplt.title(\"Optimal Number of Clusters\")\nplt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To deal with unevenly sized blobs one can increase the number of random\ninitializations. In this case we set `n_init=10` to avoid finding a\nsub-optimal local minimum. For more details see `kmeans_sparse_high_dim`.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "y_pred = KMeans(n_clusters=3, n_init=10, random_state=random_state).fit_predict(\n X_filtered\n)\nplt.scatter(X_filtered[:, 0], X_filtered[:, 1], c=y_pred)\nplt.title(\"Unevenly Sized Blobs \\nwith several initializations\")\nplt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As anisotropic and unequal variances are real limitations of the k-means\nalgorithm, here we propose instead the use of\n:class:`~sklearn.mixture.GaussianMixture`, which also assumes gaussian\nclusters but does not impose any constraints on their variances. Notice that\none still has to find the correct number of blobs (see\n`sphx_glr_auto_examples_mixture_plot_gmm_selection.py`).\n\nFor an example on how other clustering methods deal with anisotropic or\nunequal variance blobs, see the example\n`sphx_glr_auto_examples_cluster_plot_cluster_comparison.py`.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.mixture import GaussianMixture\n\nfig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(12, 6))\n\ny_pred = GaussianMixture(n_components=3).fit_predict(X_aniso)\nax1.scatter(X_aniso[:, 0], X_aniso[:, 1], c=y_pred)\nax1.set_title(\"Anisotropically Distributed Blobs\")\n\ny_pred = GaussianMixture(n_components=3).fit_predict(X_varied)\nax2.scatter(X_varied[:, 0], X_varied[:, 1], c=y_pred)\nax2.set_title(\"Unequal Variance\")\n\nplt.suptitle(\"Gaussian mixture clusters\").set_y(0.95)\nplt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Final remarks\n\nIn high-dimensional spaces, Euclidean distances tend to become inflated\n(not shown in this example). Running a dimensionality reduction algorithm\nprior to k-means clustering can alleviate this problem and speed up the\ncomputations (see the example\n`sphx_glr_auto_examples_text_plot_document_clustering.py`).\n\nIn the case where clusters are known to be isotropic, have similar variance\nand are not too sparse, the k-means algorithm is quite effective and is one of\nthe fastest clustering algorithms available. This advantage is lost if one has\nto restart it several times to avoid convergence to a local minimum.\n\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.21" } }, "nbformat": 4, "nbformat_minor": 0 }