{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# A demo of K-Means clustering on the handwritten digits data\n\nIn this example we compare the various initialization strategies for K-means in\nterms of runtime and quality of the results.\n\nAs the ground truth is known here, we also apply different cluster quality\nmetrics to judge the goodness of fit of the cluster labels to the ground truth.\n\nCluster quality metrics evaluated (see `clustering_evaluation` for\ndefinitions and discussions of the metrics):\n\n=========== ========================================================\nShorthand full name\n=========== ========================================================\nhomo homogeneity score\ncompl completeness score\nv-meas V measure\nARI adjusted Rand index\nAMI adjusted mutual information\nsilhouette silhouette coefficient\n=========== ========================================================\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Authors: The scikit-learn developers\n# SPDX-License-Identifier: BSD-3-Clause" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load the dataset\n\nWe will start by loading the `digits` dataset. This dataset contains\nhandwritten digits from 0 to 9. In the context of clustering, one would like\nto group images such that the handwritten digits on the image are the same.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import numpy as np\n\nfrom sklearn.datasets import load_digits\n\ndata, labels = load_digits(return_X_y=True)\n(n_samples, n_features), n_digits = data.shape, np.unique(labels).size\n\nprint(f\"# digits: {n_digits}; # samples: {n_samples}; # features {n_features}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Define our evaluation benchmark\n\nWe will first our evaluation benchmark. During this benchmark, we intend to\ncompare different initialization methods for KMeans. Our benchmark will:\n\n* create a pipeline which will scale the data using a\n :class:`~sklearn.preprocessing.StandardScaler`;\n* train and time the pipeline fitting;\n* measure the performance of the clustering obtained via different metrics.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from time import time\n\nfrom sklearn import metrics\nfrom sklearn.pipeline import make_pipeline\nfrom sklearn.preprocessing import StandardScaler\n\n\ndef bench_k_means(kmeans, name, data, labels):\n \"\"\"Benchmark to evaluate the KMeans initialization methods.\n\n Parameters\n ----------\n kmeans : KMeans instance\n A :class:`~sklearn.cluster.KMeans` instance with the initialization\n already set.\n name : str\n Name given to the strategy. It will be used to show the results in a\n table.\n data : ndarray of shape (n_samples, n_features)\n The data to cluster.\n labels : ndarray of shape (n_samples,)\n The labels used to compute the clustering metrics which requires some\n supervision.\n \"\"\"\n t0 = time()\n estimator = make_pipeline(StandardScaler(), kmeans).fit(data)\n fit_time = time() - t0\n results = [name, fit_time, estimator[-1].inertia_]\n\n # Define the metrics which require only the true labels and estimator\n # labels\n clustering_metrics = [\n metrics.homogeneity_score,\n metrics.completeness_score,\n metrics.v_measure_score,\n metrics.adjusted_rand_score,\n metrics.adjusted_mutual_info_score,\n ]\n results += [m(labels, estimator[-1].labels_) for m in clustering_metrics]\n\n # The silhouette score requires the full dataset\n results += [\n metrics.silhouette_score(\n data,\n estimator[-1].labels_,\n metric=\"euclidean\",\n sample_size=300,\n )\n ]\n\n # Show the results\n formatter_result = (\n \"{:9s}\\t{:.3f}s\\t{:.0f}\\t{:.3f}\\t{:.3f}\\t{:.3f}\\t{:.3f}\\t{:.3f}\\t{:.3f}\"\n )\n print(formatter_result.format(*results))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Run the benchmark\n\nWe will compare three approaches:\n\n* an initialization using `k-means++`. This method is stochastic and we will\n run the initialization 4 times;\n* a random initialization. This method is stochastic as well and we will run\n the initialization 4 times;\n* an initialization based on a :class:`~sklearn.decomposition.PCA`\n projection. Indeed, we will use the components of the\n :class:`~sklearn.decomposition.PCA` to initialize KMeans. This method is\n deterministic and a single initialization suffice.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.cluster import KMeans\nfrom sklearn.decomposition import PCA\n\nprint(82 * \"_\")\nprint(\"init\\t\\ttime\\tinertia\\thomo\\tcompl\\tv-meas\\tARI\\tAMI\\tsilhouette\")\n\nkmeans = KMeans(init=\"k-means++\", n_clusters=n_digits, n_init=4, random_state=0)\nbench_k_means(kmeans=kmeans, name=\"k-means++\", data=data, labels=labels)\n\nkmeans = KMeans(init=\"random\", n_clusters=n_digits, n_init=4, random_state=0)\nbench_k_means(kmeans=kmeans, name=\"random\", data=data, labels=labels)\n\npca = PCA(n_components=n_digits).fit(data)\nkmeans = KMeans(init=pca.components_, n_clusters=n_digits, n_init=1)\nbench_k_means(kmeans=kmeans, name=\"PCA-based\", data=data, labels=labels)\n\nprint(82 * \"_\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Visualize the results on PCA-reduced data\n\n:class:`~sklearn.decomposition.PCA` allows to project the data from the\noriginal 64-dimensional space into a lower dimensional space. Subsequently,\nwe can use :class:`~sklearn.decomposition.PCA` to project into a\n2-dimensional space and plot the data and the clusters in this new space.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n\nreduced_data = PCA(n_components=2).fit_transform(data)\nkmeans = KMeans(init=\"k-means++\", n_clusters=n_digits, n_init=4)\nkmeans.fit(reduced_data)\n\n# Step size of the mesh. Decrease to increase the quality of the VQ.\nh = 0.02 # point in the mesh [x_min, x_max]x[y_min, y_max].\n\n# Plot the decision boundary. For that, we will assign a color to each\nx_min, x_max = reduced_data[:, 0].min() - 1, reduced_data[:, 0].max() + 1\ny_min, y_max = reduced_data[:, 1].min() - 1, reduced_data[:, 1].max() + 1\nxx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))\n\n# Obtain labels for each point in mesh. Use last trained model.\nZ = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])\n\n# Put the result into a color plot\nZ = Z.reshape(xx.shape)\nplt.figure(1)\nplt.clf()\nplt.imshow(\n Z,\n interpolation=\"nearest\",\n extent=(xx.min(), xx.max(), yy.min(), yy.max()),\n cmap=plt.cm.Paired,\n aspect=\"auto\",\n origin=\"lower\",\n)\n\nplt.plot(reduced_data[:, 0], reduced_data[:, 1], \"k.\", markersize=2)\n# Plot the centroids as a white X\ncentroids = kmeans.cluster_centers_\nplt.scatter(\n centroids[:, 0],\n centroids[:, 1],\n marker=\"x\",\n s=169,\n linewidths=3,\n color=\"w\",\n zorder=10,\n)\nplt.title(\n \"K-means clustering on the digits dataset (PCA-reduced data)\\n\"\n \"Centroids are marked with white cross\"\n)\nplt.xlim(x_min, x_max)\nplt.ylim(y_min, y_max)\nplt.xticks(())\nplt.yticks(())\nplt.show()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.21" } }, "nbformat": 4, "nbformat_minor": 0 }