{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Demo of DBSCAN clustering algorithm\n\nDBSCAN (Density-Based Spatial Clustering of Applications with Noise) finds core\nsamples in regions of high density and expands clusters from them. This\nalgorithm is good for data which contains clusters of similar density.\n\nSee the `sphx_glr_auto_examples_cluster_plot_cluster_comparison.py` example\nfor a demo of different clustering algorithms on 2D datasets.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Authors: The scikit-learn developers\n# SPDX-License-Identifier: BSD-3-Clause" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data generation\n\nWe use :class:`~sklearn.datasets.make_blobs` to create 3 synthetic clusters.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.datasets import make_blobs\nfrom sklearn.preprocessing import StandardScaler\n\ncenters = [[1, 1], [-1, -1], [1, -1]]\nX, labels_true = make_blobs(\n n_samples=750, centers=centers, cluster_std=0.4, random_state=0\n)\n\nX = StandardScaler().fit_transform(X)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can visualize the resulting data:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n\nplt.scatter(X[:, 0], X[:, 1])\nplt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Compute DBSCAN\n\nOne can access the labels assigned by :class:`~sklearn.cluster.DBSCAN` using\nthe `labels_` attribute. Noisy samples are given the label math:`-1`.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import numpy as np\n\nfrom sklearn import metrics\nfrom sklearn.cluster import DBSCAN\n\ndb = DBSCAN(eps=0.3, min_samples=10).fit(X)\nlabels = db.labels_\n\n# Number of clusters in labels, ignoring noise if present.\nn_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)\nn_noise_ = list(labels).count(-1)\n\nprint(\"Estimated number of clusters: %d\" % n_clusters_)\nprint(\"Estimated number of noise points: %d\" % n_noise_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Clustering algorithms are fundamentally unsupervised learning methods.\nHowever, since :class:`~sklearn.datasets.make_blobs` gives access to the true\nlabels of the synthetic clusters, it is possible to use evaluation metrics\nthat leverage this \"supervised\" ground truth information to quantify the\nquality of the resulting clusters. Examples of such metrics are the\nhomogeneity, completeness, V-measure, Rand-Index, Adjusted Rand-Index and\nAdjusted Mutual Information (AMI).\n\nIf the ground truth labels are not known, evaluation can only be performed\nusing the model results itself. In that case, the Silhouette Coefficient comes\nin handy.\n\nFor more information, see the\n`sphx_glr_auto_examples_cluster_plot_adjusted_for_chance_measures.py`\nexample or the `clustering_evaluation` module.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print(f\"Homogeneity: {metrics.homogeneity_score(labels_true, labels):.3f}\")\nprint(f\"Completeness: {metrics.completeness_score(labels_true, labels):.3f}\")\nprint(f\"V-measure: {metrics.v_measure_score(labels_true, labels):.3f}\")\nprint(f\"Adjusted Rand Index: {metrics.adjusted_rand_score(labels_true, labels):.3f}\")\nprint(\n \"Adjusted Mutual Information:\"\n f\" {metrics.adjusted_mutual_info_score(labels_true, labels):.3f}\"\n)\nprint(f\"Silhouette Coefficient: {metrics.silhouette_score(X, labels):.3f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Plot results\n\nCore samples (large dots) and non-core samples (small dots) are color-coded\naccording to the assigned cluster. Samples tagged as noise are represented in\nblack.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "unique_labels = set(labels)\ncore_samples_mask = np.zeros_like(labels, dtype=bool)\ncore_samples_mask[db.core_sample_indices_] = True\n\ncolors = [plt.cm.Spectral(each) for each in np.linspace(0, 1, len(unique_labels))]\nfor k, col in zip(unique_labels, colors):\n if k == -1:\n # Black used for noise.\n col = [0, 0, 0, 1]\n\n class_member_mask = labels == k\n\n xy = X[class_member_mask & core_samples_mask]\n plt.plot(\n xy[:, 0],\n xy[:, 1],\n \"o\",\n markerfacecolor=tuple(col),\n markeredgecolor=\"k\",\n markersize=14,\n )\n\n xy = X[class_member_mask & ~core_samples_mask]\n plt.plot(\n xy[:, 0],\n xy[:, 1],\n \"o\",\n markerfacecolor=tuple(col),\n markeredgecolor=\"k\",\n markersize=6,\n )\n\nplt.title(f\"Estimated number of clusters: {n_clusters_}\")\nplt.show()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.21" } }, "nbformat": 4, "nbformat_minor": 0 }