{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Selecting the number of clusters with silhouette analysis on KMeans clustering\n\nSilhouette analysis can be used to study the separation distance between the\nresulting clusters. The silhouette plot displays a measure of how close each\npoint in one cluster is to points in the neighboring clusters and thus provides\na way to assess parameters like number of clusters visually. This measure has a\nrange of [-1, 1].\n\nSilhouette coefficients (as these values are referred to as) near +1 indicate\nthat the sample is far away from the neighboring clusters. A value of 0\nindicates that the sample is on or very close to the decision boundary between\ntwo neighboring clusters and negative values indicate that those samples might\nhave been assigned to the wrong cluster.\n\nIn this example the silhouette analysis is used to choose an optimal value for\n``n_clusters``. The silhouette plot shows that the ``n_clusters`` value of 3, 5\nand 6 are a bad pick for the given data due to the presence of clusters with\nbelow average silhouette scores and also due to wide fluctuations in the size\nof the silhouette plots. Silhouette analysis is more ambivalent in deciding\nbetween 2 and 4.\n\nAlso from the thickness of the silhouette plot the cluster size can be\nvisualized. The silhouette plot for cluster 0 when ``n_clusters`` is equal to\n2, is bigger in size owing to the grouping of the 3 sub clusters into one big\ncluster. However when the ``n_clusters`` is equal to 4, all the plots are more\nor less of similar thickness and hence are of similar sizes as can be also\nverified from the labelled scatter plot on the right.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Authors: The scikit-learn developers\n# SPDX-License-Identifier: BSD-3-Clause\n\nimport matplotlib.cm as cm\nimport matplotlib.pyplot as plt\nimport numpy as np\n\nfrom sklearn.cluster import KMeans\nfrom sklearn.datasets import make_blobs\nfrom sklearn.metrics import silhouette_samples, silhouette_score\n\n# Generating the sample data from make_blobs\n# This particular setting has one distinct cluster and 3 clusters placed close\n# together.\nX, y = make_blobs(\n n_samples=500,\n n_features=2,\n centers=4,\n cluster_std=1,\n center_box=(-10.0, 10.0),\n shuffle=True,\n random_state=1,\n) # For reproducibility\n\nrange_n_clusters = [2, 3, 4, 5, 6]\n\nfor n_clusters in range_n_clusters:\n # Create a subplot with 1 row and 2 columns\n fig, (ax1, ax2) = plt.subplots(1, 2)\n fig.set_size_inches(18, 7)\n\n # The 1st subplot is the silhouette plot\n # The silhouette coefficient can range from -1, 1 but in this example all\n # lie within [-0.1, 1]\n ax1.set_xlim([-0.1, 1])\n # The (n_clusters+1)*10 is for inserting blank space between silhouette\n # plots of individual clusters, to demarcate them clearly.\n ax1.set_ylim([0, len(X) + (n_clusters + 1) * 10])\n\n # Initialize the clusterer with n_clusters value and a random generator\n # seed of 10 for reproducibility.\n clusterer = KMeans(n_clusters=n_clusters, random_state=10)\n cluster_labels = clusterer.fit_predict(X)\n\n # The silhouette_score gives the average value for all the samples.\n # This gives a perspective into the density and separation of the formed\n # clusters\n silhouette_avg = silhouette_score(X, cluster_labels)\n print(\n \"For n_clusters =\",\n n_clusters,\n \"The average silhouette_score is :\",\n silhouette_avg,\n )\n\n # Compute the silhouette scores for each sample\n sample_silhouette_values = silhouette_samples(X, cluster_labels)\n\n y_lower = 10\n for i in range(n_clusters):\n # Aggregate the silhouette scores for samples belonging to\n # cluster i, and sort them\n ith_cluster_silhouette_values = sample_silhouette_values[cluster_labels == i]\n\n ith_cluster_silhouette_values.sort()\n\n size_cluster_i = ith_cluster_silhouette_values.shape[0]\n y_upper = y_lower + size_cluster_i\n\n color = cm.nipy_spectral(float(i) / n_clusters)\n ax1.fill_betweenx(\n np.arange(y_lower, y_upper),\n 0,\n ith_cluster_silhouette_values,\n facecolor=color,\n edgecolor=color,\n alpha=0.7,\n )\n\n # Label the silhouette plots with their cluster numbers at the middle\n ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))\n\n # Compute the new y_lower for next plot\n y_lower = y_upper + 10 # 10 for the 0 samples\n\n ax1.set_title(\"The silhouette plot for the various clusters.\")\n ax1.set_xlabel(\"The silhouette coefficient values\")\n ax1.set_ylabel(\"Cluster label\")\n\n # The vertical line for average silhouette score of all the values\n ax1.axvline(x=silhouette_avg, color=\"red\", linestyle=\"--\")\n\n ax1.set_yticks([]) # Clear the yaxis labels / ticks\n ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])\n\n # 2nd Plot showing the actual clusters formed\n colors = cm.nipy_spectral(cluster_labels.astype(float) / n_clusters)\n ax2.scatter(\n X[:, 0], X[:, 1], marker=\".\", s=30, lw=0, alpha=0.7, c=colors, edgecolor=\"k\"\n )\n\n # Labeling the clusters\n centers = clusterer.cluster_centers_\n # Draw white circles at cluster centers\n ax2.scatter(\n centers[:, 0],\n centers[:, 1],\n marker=\"o\",\n c=\"white\",\n alpha=1,\n s=200,\n edgecolor=\"k\",\n )\n\n for i, c in enumerate(centers):\n ax2.scatter(c[0], c[1], marker=\"$%d$\" % i, alpha=1, s=50, edgecolor=\"k\")\n\n ax2.set_title(\"The visualization of the clustered data.\")\n ax2.set_xlabel(\"Feature space for the 1st feature\")\n ax2.set_ylabel(\"Feature space for the 2nd feature\")\n\n plt.suptitle(\n \"Silhouette analysis for KMeans clustering on sample data with n_clusters = %d\"\n % n_clusters,\n fontsize=14,\n fontweight=\"bold\",\n )\n\nplt.show()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.21" } }, "nbformat": 4, "nbformat_minor": 0 }