{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Agglomerative clustering with different metrics\n\nDemonstrates the effect of different metrics on the hierarchical clustering.\n\nThe example is engineered to show the effect of the choice of different\nmetrics. It is applied to waveforms, which can be seen as\nhigh-dimensional vector. Indeed, the difference between metrics is\nusually more pronounced in high dimension (in particular for euclidean\nand cityblock).\n\nWe generate data from three groups of waveforms. Two of the waveforms\n(waveform 1 and waveform 2) are proportional one to the other. The cosine\ndistance is invariant to a scaling of the data, as a result, it cannot\ndistinguish these two waveforms. Thus even with no noise, clustering\nusing this distance will not separate out waveform 1 and 2.\n\nWe add observation noise to these waveforms. We generate very sparse\nnoise: only 6% of the time points contain noise. As a result, the\nl1 norm of this noise (ie \"cityblock\" distance) is much smaller than it's\nl2 norm (\"euclidean\" distance). This can be seen on the inter-class\ndistance matrices: the values on the diagonal, that characterize the\nspread of the class, are much bigger for the Euclidean distance than for\nthe cityblock distance.\n\nWhen we apply clustering to the data, we find that the clustering\nreflects what was in the distance matrices. Indeed, for the Euclidean\ndistance, the classes are ill-separated because of the noise, and thus\nthe clustering does not separate the waveforms. For the cityblock\ndistance, the separation is good and the waveform classes are recovered.\nFinally, the cosine distance does not separate at all waveform 1 and 2,\nthus the clustering puts them in the same cluster.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Authors: The scikit-learn developers\n# SPDX-License-Identifier: BSD-3-Clause\n\nimport matplotlib.patheffects as PathEffects\nimport matplotlib.pyplot as plt\nimport numpy as np\n\nfrom sklearn.cluster import AgglomerativeClustering\nfrom sklearn.metrics import pairwise_distances\n\nnp.random.seed(0)\n\n# Generate waveform data\nn_features = 2000\nt = np.pi * np.linspace(0, 1, n_features)\n\n\ndef sqr(x):\n return np.sign(np.cos(x))\n\n\nX = list()\ny = list()\nfor i, (phi, a) in enumerate([(0.5, 0.15), (0.5, 0.6), (0.3, 0.2)]):\n for _ in range(30):\n phase_noise = 0.01 * np.random.normal()\n amplitude_noise = 0.04 * np.random.normal()\n additional_noise = 1 - 2 * np.random.rand(n_features)\n # Make the noise sparse\n additional_noise[np.abs(additional_noise) < 0.997] = 0\n\n X.append(\n 12\n * (\n (a + amplitude_noise) * (sqr(6 * (t + phi + phase_noise)))\n + additional_noise\n )\n )\n y.append(i)\n\nX = np.array(X)\ny = np.array(y)\n\nn_clusters = 3\n\nlabels = (\"Waveform 1\", \"Waveform 2\", \"Waveform 3\")\n\ncolors = [\"#f7bd01\", \"#377eb8\", \"#f781bf\"]\n\n# Plot the ground-truth labelling\nplt.figure()\nplt.axes([0, 0, 1, 1])\nfor l, color, n in zip(range(n_clusters), colors, labels):\n lines = plt.plot(X[y == l].T, c=color, alpha=0.5)\n lines[0].set_label(n)\n\nplt.legend(loc=\"best\")\n\nplt.axis(\"tight\")\nplt.axis(\"off\")\nplt.suptitle(\"Ground truth\", size=20, y=1)\n\n\n# Plot the distances\nfor index, metric in enumerate([\"cosine\", \"euclidean\", \"cityblock\"]):\n avg_dist = np.zeros((n_clusters, n_clusters))\n plt.figure(figsize=(5, 4.5))\n for i in range(n_clusters):\n for j in range(n_clusters):\n avg_dist[i, j] = pairwise_distances(\n X[y == i], X[y == j], metric=metric\n ).mean()\n avg_dist /= avg_dist.max()\n for i in range(n_clusters):\n for j in range(n_clusters):\n t = plt.text(\n i,\n j,\n \"%5.3f\" % avg_dist[i, j],\n verticalalignment=\"center\",\n horizontalalignment=\"center\",\n )\n t.set_path_effects(\n [PathEffects.withStroke(linewidth=5, foreground=\"w\", alpha=0.5)]\n )\n\n plt.imshow(avg_dist, interpolation=\"nearest\", cmap=\"cividis\", vmin=0)\n plt.xticks(range(n_clusters), labels, rotation=45)\n plt.yticks(range(n_clusters), labels)\n plt.colorbar()\n plt.suptitle(\"Interclass %s distances\" % metric, size=18, y=1)\n plt.tight_layout()\n\n\n# Plot clustering results\nfor index, metric in enumerate([\"cosine\", \"euclidean\", \"cityblock\"]):\n model = AgglomerativeClustering(\n n_clusters=n_clusters, linkage=\"average\", metric=metric\n )\n model.fit(X)\n plt.figure()\n plt.axes([0, 0, 1, 1])\n for l, color in zip(np.arange(model.n_clusters), colors):\n plt.plot(X[model.labels_ == l].T, c=color, alpha=0.5)\n plt.axis(\"tight\")\n plt.axis(\"off\")\n plt.suptitle(\"AgglomerativeClustering(metric=%s)\" % metric, size=20, y=1)\n\n\nplt.show()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.21" } }, "nbformat": 4, "nbformat_minor": 0 }