{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Intro to Data Science\n", "## Part VI. - Model Evaluation, Hyperparameter optimization, Clustering\n", "\n", "### Table of contents\n", "\n", "- #### Model evaluation\n", " - Theory\n", " - Classification Metrics\n", " - Accuracy\n", " - Confusion matrix\n", " - Precision, Recall, F1 score\n", " - ROC curve\n", " - Regression Metrics\n", " - Explained variance\n", " - MAE\n", " - MSE\n", " - Clustering Metrics\n", " \n", "- #### Hyperparameter optimization\n", " - Theory\n", " - Cross Validation\n", " - Grid Search Cross Validation\n", " - Randomized Search Cross Validation\n", " - Other Hyperparameter searching methods\n", " \n", "- #### Clustering\n", " - Theory\n", " - Clustering methods\n", " - K-means\n", " - DBSCAN\n", " - Hierarchical clustering\n", " - Spectral clustering\n", " - Gaussian Mixture Models\n", " - Cluster Validation\n", " \n", "---\n", "\n", "# I. Model Evaluation\n", "\n", "## What is Model Evaluation?\n", "\n", "When working with machine learning algorithms, data mining techniques, or statistical models, it is essential to assess whether a model has been trained effectively. Depending on the task, various metrics are available to measure the performance of a fitted model. Beyond raw metrics, there are key concepts for comparing models. Some techniques help identify overfitting, while others determine whether one model is simpler or more generalizable than another.\n", "\n", "## Why is Model Evaluation Important?\n", "\n", "To find the optimal solution for a given problem, it is crucial to evaluate a model’s performance. Proper evaluation helps decide whether to continue training, adjust the preprocessing pipeline, or explore alternative models.\n", "\n", "## Tools for Model Evaluation\n", "\n", "- **Classification metrics** \n", " - Accuracy \n", " - Precision \n", " - Recall \n", " - Precision-Recall Curve \n", " - F1 Score \n", " - Confusion Matrix \n", " - ROC Curve \n", "\n", "- **Regression metrics** \n", " - Mean Absolute Error (MAE) \n", " - Root Mean Squared Error (RMSE) \n", " - Explained Variance Score \n", "\n", "- **Clustering metrics** \n", "\n", "- **Cross-Validation** \n", "\n", "- **Other techniques** (e.g., model comparison methods) " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "\n", "import numpy as np\n", "import pandas as pd\n", "\n", "from sklearn.datasets import load_digits\n", "from sklearn.model_selection import train_test_split\n", "\n", "from sklearn.pipeline import Pipeline\n", "\n", "from sklearn.neural_network import MLPClassifier\n", "from sklearn.linear_model import LogisticRegression\n", "\n", "np.random.seed = 42" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_dig, y_dig = load_digits(return_X_y=True)\n", "X_train, X_test, y_train, y_test = train_test_split(X_dig, y_dig,\n", " test_size=.25,\n", " random_state=42)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nn_pipe = Pipeline([('nn', MLPClassifier(hidden_layer_sizes=(5,), random_state=42))])\n", "nn_pipe.fit(X_train, y_train)\n", "y_hat = nn_pipe.predict(X_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Classification Metrics\n", "\n", "### a) [Accuracy](http://scikit-learn.org/stable/modules/model_evaluation.html#accuracy-score)\n", "\n", "Accuracy measures the proportion of correctly classified samples and is defined as:\n", "\n", "$$\\text{accuracy}(y, \\hat{y}) = \\frac{1}{n_\\text{samples}} \\sum_{i=0}^{n_\\text{samples}-1} 1(\\hat{y}_i = y_i)$$\n", "\n", "where $1(x)$ is the indicator function that returns 1 if the condition is true and 0 otherwise.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import accuracy_score" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "accuracy_score(y_test, y_hat)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### b) Confusion Matrix\n", "\n", "A [confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix#Table_of_confusion) provides a summary of prediction results by comparing the model’s predicted labels with the actual labels.\n", "\n", "![Type I and II errors](./pics/confusion_matrix_explained.png) \n", "via [@jimgthornton](https://twitter.com/jimgthornton)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import confusion_matrix" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "confusion_matrix(y_test, y_hat)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sns.heatmap(confusion_matrix(y_test, y_hat))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### c) Precision, Recall, and F1 Score\n", "\n", "*\"Intuitively, precision is the ability of the classifier not to label as positive a sample that is negative, and recall is the ability of the classifier to find all the positive samples.\"* \n", "— [scikit-learn documentation](http://scikit-learn.org/stable/modules/model_evaluation.html#precision-recall-and-f-measures)\n", "\n", "Precision, recall, and F1 score rely on four key values from the confusion matrix: \n", "- **True Positives (TP)** \n", "- **True Negatives (TN)** \n", "- **False Positives (FP)** \n", "- **False Negatives (FN)** \n", "\n", "These can be visualized in the following confusion matrix:\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Confusion MatrixActual
PositiveNegative
PredictedPositiveTPFP
NegativeFNTN
\n", "\n", "- **[Precision](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html#sklearn.metrics.precision_score)**: \n", " The fraction of correctly predicted positive instances among all predicted positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative. \n", " $$\\text{Precision} = \\frac{\\text{TP}}{\\text{TP} + \\text{FP}}$$ \n", "\n", "- **[Recall](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html#sklearn.metrics.recall_score)**: \n", " The fraction of correctly predicted positive instances among all actual positives. The recall is intuitively the ability of the classifier to find all the positive samples. \n", " $$\\text{Recall} = \\frac{\\text{TP}}{\\text{TP} + \\text{FN}}$$ \n", "\n", "- **[F1 Score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score)**: \n", " The harmonic mean of precision and recall, balancing both metrics. \n", " $$F1 = 2 \\cdot \\frac{\\text{Precision} \\cdot \\text{Recall}}{\\text{Precision} + \\text{Recall}}$$ \n", "\n", "These metrics are primarily designed for binary classification problems. However, for multi-class and multi-label problems, different averaging strategies exist: \n", "- **Macro averaging** (`average='macro'`): Computes the unweighted mean of the metric across all classes. \n", "- **Micro averaging** (`average='micro'`): Aggregates TP, FP, FN across all classes before computing the metric. \n", "\n", "For a detailed discussion of multi-class strategies, refer to [scikit-learn documentation](http://scikit-learn.org/stable/modules/model_evaluation.html#multiclass-and-multilabel-classification).\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import precision_score\n", "precision_score(y_test, y_hat, average=None)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import recall_score\n", "recall_score(y_test, y_hat, average=None)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import f1_score\n", "f1_score(y_test, y_hat, average=None)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### d) [ROC Curve](http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html)\n", "\n", "*A Receiver Operating Characteristic (ROC) curve is a graphical representation of a binary classifier’s performance across different threshold settings. It plots the true positive rate (TPR) against the false positive rate (FPR).*\n", "\n", "From the [scikit-learn user guide](https://en.wikipedia.org/wiki/Receiver_operating_characteristic):\n", "\n", "- **True Positive Rate (TPR) / Sensitivity**: \n", " $$\\text{TPR} = \\frac{\\text{TP}}{\\text{TP} + \\text{FN}}$$ \n", "- **False Positive Rate (FPR)**: \n", " $$\\text{FPR} = \\frac{\\text{FP}}{\\text{FP} + \\text{TN}}$$ \n", "\n", "The most important metric extracted from the ROC curve is the **Area Under the Curve (AUC)**, which measures the probability that a randomly chosen positive instance is ranked higher than a randomly chosen negative instance. \n", "\n", "According to [Wikipedia](https://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_the_curve): \n", "*\"AUC is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one.\"* \n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import roc_curve\n", "from sklearn.metrics import auc" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since ROC analysis is only applicable to binary classification, multi-class problems require transformation into a binary format:\n", "\n", "1. Generate prediction probabilities for each class. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "y_score = nn_pipe.predict_proba(X_test)\n", "classes = np.unique(y_dig)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2. Convert multi-class labels into binary form (1 for the current class, 0 otherwise). " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def onevsrest(array, label):\n", " return (array == label).astype(int)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "3. Compute **TPR**, **FPR**, and **AUC** for each class using `roc_curve`.\n", "> `roc_curve` returns the __FPR__, __TPR__ and the __threshold__ arrays" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fpr, tpr, thres, roc_auc = {}, {}, {}, {}\n", "for i in classes:\n", " fpr[i], tpr[i], thres[i] = roc_curve(onevsrest(y_test, i), y_score[:, i])\n", " roc_auc[i] = auc(fpr[i], tpr[i])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "4. Plot all class-wise ROC curves in a single figure." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mycolors = sns.color_palette('muted', n_colors=len(classes))\n", "fig, ax = plt.subplots(figsize=(8, 8))\n", "\n", "ax.set_xlim([0.0, 1.0])\n", "ax.set_ylim([0.0, 1.05])\n", "ax.set_xlabel('False Positive Rate')\n", "ax.set_ylabel('True Positive Rate')\n", "\n", "# plot ROC for random baseline classifier\n", "ax.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')\n", "\n", "# plot ROC for each class\n", "for cls in classes:\n", " label = (f'ROC curve for {cls} (area = {roc_auc[cls]:0.2f})')\n", "\n", " ax.plot(fpr[cls], tpr[cls], color=mycolors[cls], lw=2, label=label)\n", " ax.legend(loc=\"lower right\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Regression metrics" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "reg_pipe = Pipeline([('reg', LogisticRegression(solver='liblinear', random_state=42))])\n", "reg_pipe.fit(X_train, y_train)\n", "y_hat = reg_pipe.predict(X_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### a) [Explained Variance Score](http://scikit-learn.org/stable/modules/model_evaluation.html#explained-variance-score)\n", "\n", "Explained variance measures how well the model accounts for the variability in the target variable. It is defined as:\n", "\n", "$$\\text{Explained Variance}(y, \\hat{y}) = 1 - \\frac{\\text{Var}(y - \\hat{y})}{\\text{Var}(y)}$$\n", "\n", "where:\n", "- $\\hat{y}$ is the predicted target output,\n", "- $y$ is the actual target output,\n", "- $\\text{Var}$ represents variance (the square of standard deviation). \n", "\n", "A score close to 1 indicates a well-fitting model, while a score close to 0 suggests poor performance." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import explained_variance_score\n", "explained_variance_score(y_test, y_hat)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### b) [Mean Absolute Error (MAE)](http://scikit-learn.org/stable/modules/model_evaluation.html#mean-absolute-error)\n", "\n", "MAE is the average absolute difference between actual and predicted values. It is a risk metric corresponding to the expected value of the absolute error loss (also known as L1 loss):\n", "\n", "$$\\text{MAE}(y, \\hat{y}) = \\frac{1}{n_{\\text{samples}}} \\sum_{i=0}^{n_{\\text{samples}}-1} \\left| y_i - \\hat{y}_i \\right|.$$\n", "\n", "- MAE treats all errors equally, making it an intuitive measure of model accuracy.\n", "- Unlike squared error metrics, it does not penalize large errors disproportionately." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import mean_absolute_error\n", "mean_absolute_error(y_test, y_hat)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### c) [Mean Squared Error (MSE)](http://scikit-learn.org/stable/modules/model_evaluation.html#mean-squared-error)\n", "\n", "MSE measures the average squared difference between actual and predicted values:\n", "\n", "$$\\text{MSE}(y, \\hat{y}) = \\frac{1}{n_{\\text{samples}}} \\sum_{i=0}^{n_{\\text{samples}} - 1} (y_i - \\hat{y}_i)^2.$$\n", "\n", "- MSE gives more weight to large errors, making it sensitive to outliers.\n", "- It is widely used due to its mathematical properties and ease of optimization.\n", "\n", "A widely used variant, the **Root Mean Squared Error (RMSE)**, is computed as:\n", "\n", "$$\\text{RMSE}(y, \\hat{y}) = \\sqrt{\\text{MSE}(y, \\hat{y})}.$$\n", "\n", "- RMSE is in the same units as the target variable, making interpretation easier.\n", "- It is more sensitive to large errors compared to MAE." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import mean_squared_error\n", "mean_squared_error(y_test, y_hat)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# II. Hyperparameter Optimization\n", "\n", "## What is Hyperparameter Optimization?\n", "\n", "According to [Wikipedia](https://en.wikipedia.org/wiki/Hyperparameter_optimization):\n", "\n", "> _\"In the context of machine learning, **hyperparameter optimization** or **model selection** is the problem of choosing a set of hyperparameters for a learning algorithm, usually with the goal of optimizing a measure of the algorithm's performance on an independent dataset. Often, cross-validation is used to estimate this generalization performance._ \n", "\n", "> _Hyperparameter optimization contrasts with actual learning problems, which are also often cast as optimization problems but optimize a loss function on the training set alone. In effect, learning algorithms learn parameters that model/reconstruct their inputs well, while hyperparameter optimization ensures the model does not overfit its data by tuning, e.g., regularization.\"_ \n", "\n", "## Why is it Important?\n", "\n", "To find the optimal solution for a given problem, multiple models with similar predictive or exploratory power need to be trained, and the simplest effective model should be selected. This process includes choosing models and tuning hyperparameters, which can be time-consuming and tedious when done manually. \n", "\n", "Automated hyperparameter optimization methods help overcome this challenge by saving time and improving results.\n", "\n", "## Common Hyperparameter Optimization Techniques\n", "\n", "- **Grid Search**\n", "- **Randomized Search**\n", "- **Bayesian Optimization**\n", "- **Gradient-based Optimization**\n", "- **TPOT (Tree-based Pipeline Optimization Tool)**\n", "- **Other metaheuristic approaches**\n", "\n", "---\n", "\n", "## [Cross-Validation](http://scikit-learn.org/stable/modules/cross_validation.html#cross-validation)\n", "\n", "To select the best model, we first need a reliable way to measure its accuracy. \n", "\n", "### Choosing a Validation Metric\n", "\n", "The choice of validation metric depends on the type of task:\n", "- For **classification**, the default metric in Scikit-learn is **accuracy**.\n", "- For **regression**, the default metric is **$R^2$ (coefficient of determination)**.\n", "- Other metrics can be selected from [this list](http://scikit-learn.org/stable/modules/model_evaluation.html#the-scoring-parameter-defining-model-evaluation-rules).\n", "\n", "### Why Cross-Validation?\n", "\n", "> _\"Learning the parameters of a prediction function and testing it on the same data is a methodological mistake: a model that simply memorizes the labels of the training samples would achieve a perfect score but fail to generalize to unseen data. This issue is called **overfitting**.\"_ \n", "> — [Scikit-learn User Guide](http://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-evaluating-estimator-performance)\n", "\n", "To prevent overfitting, we split the dataset into two parts: \n", "- **Training set** – Used to train the model.\n", "- **Test set** – Used to evaluate the model’s performance.\n", "\n", "However, a single train-test split can lead to **high variance in model performance**. To obtain a more reliable estimate of a model’s accuracy, we use **Cross-Validation (CV)**, where the dataset is split multiple times using different strategies. \n", "\n", "More details on different cross-validation strategies can be found [here](http://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-iterators).\n", "\n", "### Hyperparameter Tuning with Cross-Validation\n", "\n", "A model's performance can vary significantly depending on its hyperparameters. To find the best settings, multiple models must be trained with different hyperparameters and evaluated using cross-validation. \n", "\n", "Cross-validation provides a good estimate of a trained model’s accuracy, but additional techniques are needed to efficiently search for the optimal hyperparameters.\n", "\n", "---\n", "\n", "### 1. Grid Search Cross-Validation\n", "\n", "Grid search is a systematic approach that:\n", "- Generates a **parameter grid** from a predefined set of values.\n", "- Evaluates the model's accuracy for every combination of hyperparameters using cross-validation.\n", "- Selects the combination that yields the best performance." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.decomposition import PCA\n", "from sklearn.model_selection import GridSearchCV" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pipe_digit = Pipeline([\n", " ('pca', PCA(svd_solver='randomized', random_state=42)),\n", " ('logistic', LogisticRegression(solver='liblinear', random_state=42))\n", "])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "param_grid = {\n", " 'pca__n_components': [20, 40, 64],\n", " 'logistic__C': np.logspace(-4, 4, 3)\n", "}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "grid_search = GridSearchCV(estimator=pipe_digit, \n", " param_grid=param_grid,\n", " n_jobs=-1,\n", " cv=5,\n", " verbose=1, \n", " return_train_score=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "grid_search.fit(X_dig, y_dig)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "grid_search.best_estimator_.get_params(deep=False)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "grid_search.best_params_, grid_search.best_score_" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "grid_search.cv_results_" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "score_dict = grid_search.cv_results_\n", "hmap = pd.DataFrame({\n", " 'mean': score_dict['mean_test_score'],\n", " 'C': [param['logistic__C'] for param in score_dict['params']],\n", " 'n': [param['pca__n_components'] for param in score_dict['params']]\n", "})" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sns.heatmap(hmap.pivot(index='C', columns='n', values='mean'), annot=True, fmt='.3f');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2. Randomized Search Cross-Validation\n", "\n", "Randomized search selects hyperparameter values randomly from predefined ranges and evaluates a fixed number of configurations using cross-validation. This approach is often more efficient than grid search, especially when the search space is large." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import RandomizedSearchCV" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "random_search_digit = RandomizedSearchCV(\n", " pipe_digit,\n", " {\n", " 'pca__n_components': np.linspace(1, 64, 64, dtype=int),\n", " 'logistic__C': np.logspace(-4, 4, 30),\n", " },\n", " n_iter=30, \n", " n_jobs=-1,\n", " cv=5,\n", " verbose=1,\n", " return_train_score=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "random_search_digit.fit(X_dig, y_dig)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "random_score_dict = random_search_digit.cv_results_\n", "hmap_r = pd.DataFrame({\n", " 'mean': random_score_dict['mean_test_score'],\n", " 'C': [param['logistic__C'] for param in random_score_dict['params']],\n", " 'n': [param['pca__n_components'] for param in random_score_dict['params']]\n", "})" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "random_search_digit.best_params_, random_search_digit.best_score_" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=(12,10))\n", "sns.heatmap(hmap_r.pivot(index='C', columns='n', values='mean'), annot=True, ax=ax)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3. Other Hyperparameter Optimization Methods\n", "\n", "Beyond grid and randomized search, there are several advanced methods for hyperparameter tuning:\n", "\n", "- TPOT – Uses genetic algorithms to automate machine learning model selection and hyperparameter tuning.\n", "- auto-sklearn – An automated machine learning (AutoML) library that optimizes both model selection and hyperparameters.\n", "- Hyperopt – Implements Bayesian optimization for more efficient hyperparameter searches.\n", "- Optuna – A powerful and flexible framework for defining and optimizing hyperparameters using pruning and efficient sampling.\n", "- Ray Tune – A scalable hyperparameter tuning framework supporting distributed execution and integration with multiple search algorithms.\n", "- Nevergrad – A derivative-free optimization platform developed by Facebook AI for hyperparameter tuning and black-box optimization.\n", "- Why Grid Search Isn't Always the Best Choice – A discussion on alternative strategies for hyperparameter optimization.\n", "\n", "These methods offer various trade-offs in terms of efficiency, interpretability, and computational cost, making them useful for different machine learning scenarios.\n", "\n", "---\n", "# III. Clustering\n", "\n", "## What is Clustering?\n", "Clustering is an unsupervised machine learning task used to uncover hidden patterns in data. \n", "_\"Unsupervised learning is the machine learning task of inferring a function to describe hidden structure from unlabeled data. Since the examples given to the learner are unlabeled, there is no error or reward signal to evaluate a potential solution.\"_ — Wikipedia\n", "\n", "_\"Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters).\"_ — Wikipedia\n", "\n", "## Why is it Important?\n", "Clustering is useful when labeled data is unavailable, making it essential for exploratory data analysis and pattern recognition. It helps in: \n", "- **Recommender systems** – Grouping similar users or items for personalized recommendations. \n", "- **Anomaly detection** – Identifying fraudulent transactions, network intrusions, or defective products. \n", "- **Medical research** – Detecting subtypes of diseases based on patient data. \n", "- **Image segmentation** – Separating objects in images based on pixel similarity. \n", "- **Social network analysis** – Identifying communities in graphs. \n", "\n", "## Clustering Algorithms\n", "Different clustering algorithms have varying assumptions about data structure, making them suited for different tasks:\n", "\n", "- **K-Means** – A fast, centroid-based algorithm that partitions data into K clusters.\n", "- **Affinity Propagation** – Uses message-passing to identify exemplars without a predefined number of clusters.\n", "- **Mean-Shift** – A density-based method that finds areas of high data concentration.\n", "- **Spectral Clustering** – Uses graph-based methods to identify clusters based on connectivity.\n", "- **Hierarchical Clustering (Ward, Agglomerative)** – Creates a tree of nested clusters, useful for visualizing relationships.\n", "- **DBSCAN (Density-Based Spatial Clustering)** – Identifies clusters based on density, ideal for noisy and irregular data.\n", "- **Gaussian Mixtures (GMMs)** – Models clusters as Gaussian distributions, allowing soft clustering.\n", "- **Birch (Balanced Iterative Reducing and Clustering using Hierarchies)** – Efficient for large datasets.\n", "- **HDBSCAN** – An improvement over DBSCAN, automatically selecting the optimal number of clusters.\n", "- **OPTICS (Ordering Points To Identify Clustering Structure)** – Works like DBSCAN but better for variable density.\n", "- **Support Vector Clustering** – Uses SVMs to detect cluster boundaries.\n", "\n", "Each method has its strengths and weaknesses depending on dataset characteristics such as density, shape, and noise level." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import make_circles\n", "from sklearn.datasets import make_moons\n", "from sklearn.datasets import make_blobs\n", "from sklearn.datasets import load_iris" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "n_clusters = 3\n", "n_samples = 1500\n", "\n", "iris = load_iris(return_X_y=True)\n", "noisy_circles = make_circles(n_samples=n_samples, factor=.5, noise=.05, random_state=42)\n", "noisy_moons = make_moons(n_samples=n_samples, noise=.05, random_state=42)\n", "blobs = make_blobs(n_samples=n_samples, random_state=42)\n", "no_structure = np.random.rand(n_samples, 2), None\n", "\n", "datasets = {\n", " 'iris': iris,\n", " 'noisy_circles': noisy_circles,\n", " 'noisy_moons': noisy_moons,\n", " 'blobs': blobs,\n", " 'no_structure': no_structure\n", "}\n", "\n", "colors = np.array(sns.color_palette('muted', n_colors=10))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def cluster_datasets(model, preprocess=None, **params):\n", " model = model(**params)\n", " results = {}\n", " Xs = {}\n", " for problem, dataset in datasets.items():\n", " X, y = dataset\n", " if preprocess:\n", " X = preprocess.fit_transform(X, y)\n", " Xs[problem] = X\n", " model.fit(X)\n", " if hasattr(model, 'labels_'):\n", " results[problem] = model.labels_.astype('int')\n", " else:\n", " results[problem] = model.predict(X)\n", " return model, Xs, results" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def plot(Xs, results):\n", " plot_num = 1\n", " plt.figure(figsize=(len(datasets) * 4, 4))\n", " for problem, X in Xs.items():\n", " plt.subplot(1, len(datasets), plot_num)\n", " plt.scatter(X[:, 0], X[:, 1], color=colors[results[problem]], edgecolors='k')\n", " plot_num += 1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. [K-Means](http://scikit-learn.org/stable/modules/clustering.html#k-means)\n", "\n", "\n", "\n", "K-Means clustering partitions $n$ objects into $k$ clusters, where each object belongs to the cluster with the nearest mean (centroid). It produces exactly $k$ distinct clusters that aim to maximize inter-cluster separation while minimizing intra-cluster variance. However, the optimal number of clusters ($k$) is not known beforehand and must be determined from the data.\n", "\n", "The objective of K-Means is to minimize total intra-cluster variance, or equivalently, the sum of squared distances between each point and its assigned cluster centroid:\n", "\n", "\n", "\n", "
\n", "\n", "### **Algorithm**:\n", "1. Select $k$ initial cluster centers randomly.\n", "2. Assign each data point to the nearest cluster center (using Euclidean distance).\n", "3. Compute new centroids by taking the mean of all points in each cluster.\n", "4. Repeat steps 2 and 3 until convergence (i.e., cluster assignments no longer change or centroids stabilize).\n", "\n", "### **Choosing the Optimal $k$**\n", "Since K-Means requires specifying $k$ in advance, various techniques help determine the best number of clusters:\n", "- **Elbow Method** – Plots the sum of squared errors (SSE) for different $k$ values and looks for an \"elbow\" where diminishing returns begin.\n", "- **Silhouette Score** – Measures how well-separated clusters are based on intra- and inter-cluster distances.\n", "- **Gap Statistic** – Compares clustering performance against randomly generated reference data.\n", "\n", "### **Considerations**\n", "- **Sensitive to Initialization** – Poor centroid initialization can lead to suboptimal solutions. Methods like **K-Means++** improve this.\n", "- **Assumes Spherical Clusters** – Struggles with non-convex or varied-density clusters (e.g., concentric circles).\n", "- **Scalability** – Efficient on large datasets but may struggle with very high-dimensional data.\n", "\n", "The animation is from the Wikipedia. \n", "The description is adapted from Dr. Saed Sayad's book __An Introduction to Data Mining__." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.preprocessing import StandardScaler\n", "from sklearn.cluster import KMeans, MiniBatchKMeans" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model, Xs, results = cluster_datasets(\n", " model=KMeans,\n", " preprocess=StandardScaler(),\n", " n_init='auto',\n", " n_clusters=3,\n", " random_state=42\n", ")\n", "plot(Xs, results)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model, Xs, results = cluster_datasets(\n", " MiniBatchKMeans,\n", " preprocess=StandardScaler(),\n", " n_init='auto',\n", " n_clusters=3,\n", " random_state=42\n", ")\n", "plot(Xs, results)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise: Cluster the digits dataset!\n", "\n", "Hints:\n", "- read with sklearn's built-in method\n", "- use standard scaling\n", "- use dimension reduction (pca)\n", "- visualize results\n", "\n", "In case you are lost, follow [sklearn's guide](https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_digits.html#sphx-glr-auto-examples-cluster-plot-kmeans-digits-py)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. [DBSCAN (Density-Based Spatial Clustering of Applications with Noise)](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html)\n", "\n", "\n", "\n", "DBSCAN is a clustering algorithm that groups points based on **density** rather than assuming spherical cluster shapes (like K-Means). It can identify clusters of arbitrary shape and is robust to noise (outliers), making it suitable for real-world datasets with irregular distributions.\n", "\n", "According to scikit-learn’s [User Guide](http://scikit-learn.org/stable/modules/clustering.html#dbscan):\n", "\n", "> *DBSCAN views clusters as areas of high density separated by areas of low density. A cluster consists of __core samples__ (high-density points) and __non-core samples__ (border points close to core samples). Noise points (outliers) remain unclustered. The density is controlled by two parameters: `eps` (the neighborhood radius) and `min_samples` (the minimum number of points required to form a dense region). Higher `min_samples` or lower `eps` increase the density requirement for a cluster to form.*\n", "\n", "
\n", "\n", "### **Key Concepts**\n", "DBSCAN classifies points into three categories:\n", "\n", "- **Core Points**: A point is a core point if at least `min_samples` points (including itself) exist within a radius of `eps`. These points form the dense regions of clusters.\n", "- **Border Points**: A point that is **within `eps` of a core point** but does not have enough neighbors to be a core point itself.\n", "- **Outliers (Noise Points)**: A point that is **not reachable from any core point** (i.e., it doesn’t belong to any cluster).\n", "\n", "### **How DBSCAN Works**\n", "1. Select an arbitrary starting point.\n", "2. Identify its `eps`-neighborhood.\n", " - If the point has at least `min_samples` neighbors, it becomes a core point and forms a new cluster.\n", " - Otherwise, it is temporarily labeled as noise (though it might later become a border point).\n", "3. Expand the cluster by recursively adding reachable points.\n", "4. Repeat until all points are classified as core, border, or noise.\n", "\n", "### **Advantages of DBSCAN**\n", "- **No need to specify the number of clusters** (unlike K-Means). \n", "- **Can detect arbitrarily shaped clusters** (useful for spatial and image data). \n", "- **Robust to noise and outliers** (outliers remain unclustered). \n", "\n", "### **Limitations**\n", "- **Choosing `eps` can be tricky**: A poorly chosen `eps` may lead to over- or under-clustering.\n", "- **Not ideal for varying-density clusters**: Struggles when clusters have different densities.\n", "- **Computationally expensive**: Slower on high-dimensional datasets compared to K-Means.\n", "\n", "Animation is from ProgrammerSought. \n", "A deeper theoretical explanation can be found on [Wikipedia](https://en.wikipedia.org/wiki/DBSCAN#Preliminary)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.cluster import DBSCAN" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model, Xs, results = cluster_datasets(\n", " DBSCAN,\n", " preprocess=StandardScaler(),\n", " eps=0.3,\n", " min_samples=3\n", ")\n", "plot(Xs, results)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise: Cluster the generated dataset with K-Means and DBSCAN!\n", "\n", "1. Data generation" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X, y = make_blobs(random_state=170, n_samples=600, centers = 5)\n", "rng = np.random.RandomState(42)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "transformation = rng.normal(size=(2, 2))\n", "X = np.dot(X, transformation)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.scatter(X[:, 0], X[:, 1])\n", "plt.xlabel(\"Feature 0\")\n", "plt.ylabel(\"Feature 1\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2. Clustering with K-means" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "3. Clustering with DBSCAN" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Examine and explain the clustering results!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. [Hierarchical Clustering](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html#sklearn.cluster.AgglomerativeClustering)\n", "\n", "\n", "\n", "
\n", "\n", "Hierarchical clustering is a family of clustering algorithms that build nested clusters in a hierarchical structure. The result is typically represented as a **dendrogram**, a tree-like diagram where:\n", "- The **root** represents all data points in a single cluster.\n", "- The **leaves** represent individual data points.\n", "- **Branches** show how clusters are merged or split at different levels.\n", "\n", "Unlike K-Means or DBSCAN, hierarchical clustering does **not** require pre-specifying the number of clusters (`k`). Instead, the hierarchy can be cut at different levels to obtain a varying number of clusters.\n", "\n", "### **Types of Hierarchical Clustering**\n", "1. **Agglomerative Clustering (Bottom-Up)**\n", " - Each data point starts as its own cluster.\n", " - Clusters are **iteratively merged** based on similarity until only one cluster remains.\n", " - This is the most common approach and is implemented in `sklearn.cluster.AgglomerativeClustering`.\n", "\n", "2. **Divisive Clustering (Top-Down)**\n", " - Starts with a **single cluster** containing all data points.\n", " - The cluster is **recursively split** into smaller clusters.\n", " - Less commonly used due to higher computational cost.\n", "\n", "### **Linkage Criteria (Merging Strategy)**\n", "The way clusters are merged in agglomerative clustering depends on the **linkage criterion**:\n", "\n", "- **Ward’s Method (Variance Minimization)** \n", " - Minimizes the **sum of squared differences** within clusters.\n", " - Produces compact, spherical clusters (similar to K-Means).\n", " - Generally preferred for balanced, well-separated clusters.\n", "\n", "- **Complete Linkage (Maximum Distance)** \n", " - Merges clusters that have the **smallest maximum pairwise distance** between points.\n", " - Tends to produce **compact** and **globular** clusters.\n", "\n", "- **Average Linkage (Mean Distance)** \n", " - Merges clusters with the **smallest average pairwise distance**.\n", " - A compromise between Ward’s and complete linkage.\n", "\n", "- **Single Linkage (Minimum Distance)** \n", " - Merges clusters with the **smallest minimum pairwise distance**.\n", " - Can lead to **chained clusters**, where points are loosely linked.\n", "\n", "### **Pros & Cons of Hierarchical Clustering**\n", "- **No need to specify `k`** beforehand. \n", "- **Dendrogram provides insights** into cluster relationships. \n", "- **Works well for small to medium datasets**. \n", "- **Computationally expensive**: O(n²) to O(n³) complexity. \n", "- **Sensitive to noise & outliers** (especially with single linkage). \n", "\n", "Hierarchical clustering can be optimized with a **connectivity matrix**, which restricts which points can be merged, reducing complexity.\n", "\n", "For more details, see scikit-learn’s [User Guide](http://scikit-learn.org/stable/modules/clustering.html#hierarchical-clustering). \n", "Animation is from ProgrammerSought." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.cluster import AgglomerativeClustering" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- complete" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model, Xs, results = cluster_datasets(\n", " AgglomerativeClustering,\n", " preprocess=StandardScaler(),\n", " n_clusters=3,\n", " linkage='complete',\n", ")\n", "plot(Xs, results)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.neighbors import kneighbors_graph" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def cluster_connections(**params):\n", " results = {}\n", " Xs = {}\n", " models = {}\n", " for problem, dataset in datasets.items():\n", " X, y = dataset\n", " X = StandardScaler().fit_transform(X, y)\n", " Xs[problem] = X\n", " connectivity = kneighbors_graph(X, n_neighbors=10, include_self=False)\n", " connectivity = 0.5 * (connectivity + connectivity.T)\n", " model = AgglomerativeClustering(connectivity=connectivity, **params)\n", " model.fit(X)\n", " results[problem] = model.labels_.astype('int')\n", " models[problem] = model\n", " return models, Xs, results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- ward" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "models, Xs, results = cluster_connections(\n", " linkage='ward',\n", " n_clusters=2,\n", ")\n", "plot(Xs, results)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- average" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "models, Xs, results = cluster_connections(\n", " linkage=\"average\",\n", " n_clusters=2,\n", ")\n", "plot(Xs, results)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise: Generating dendrograms\n", "\n", "1. Generate small dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_dummy, y_dummy = make_blobs(n_samples=10, n_features=2, random_state=42)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots()\n", "ax.scatter(X_dummy[:, 0], X_dummy[:, 1], c=y_dummy)\n", "for i in range(len(y_dummy)):\n", " ax.annotate(i, (X_dummy[i, 0], X_dummy[i, 1]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2. Use scipy's method to generat dendrogram" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from scipy.cluster.hierarchy import dendrogram, linkage\n", "Z = linkage(X_dummy)\n", "dendrogram(Z);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. [Spectral Clustering](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.SpectralClustering.html)\n", "\n", "Spectral Clustering is a graph-based clustering algorithm that uses the eigenvalues of a similarity matrix to perform dimensionality reduction before applying a clustering method like K-Means. Unlike traditional clustering methods that rely on geometric distances (e.g., K-Means), Spectral Clustering is particularly effective for identifying non-convex and arbitrarily shaped clusters.\n", "\n", "### How it works:\n", "1. Construct a similarity graph from the dataset, where nodes represent data points and edges represent pairwise similarities.\n", "2. Compute the __graph [Laplacian](](https://en.wikipedia.org/wiki/Laplacian_matrix))__ (matrix representation of a graph), which encodes the structure of the graph.\n", "3. Compute the __eigenvectors__ of the Laplacian and use them to embed the data in a lower-dimensional space.\n", "4. Apply K-Means clustering in this new space to assign data points to clusters.\n", "\n", "### Advantages:\n", "- Can capture complex cluster structures that are not linearly separable.\n", "- Works well when the data has a natural graph structure, such as social networks or image segmentation.\n", "\n", "### Limitations:\n", "- Computationally expensive for large datasets due to eigenvalue decomposition.\n", "- Requires prior knowledge of the number of clusters.\n", "- Sensitive to the choice of similarity function.\n", "\n", "Spectral Clustering is widely used in applications such as image segmentation, community detection in social networks, and bioinformatics." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.cluster import SpectralClustering" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model, Xs, results = cluster_datasets(\n", " SpectralClustering,\n", " preprocess=StandardScaler(),\n", " n_clusters=2,\n", " gamma=1e1,\n", " random_state=42\n", ")\n", "plot(Xs, results)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. [Gaussian Mixture Models](http://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html)\n", "\n", "\n", "\n", "
\n", "\n", "Gaussian Mixture Models (GMM) are a probabilistic clustering method that models the data as a mixture of multiple Gaussian distributions. Unlike K-Means, which assigns each point to a single cluster, GMM provides a __soft clustering__ approach, where each point is assigned a probability of belonging to multiple clusters. \n", "\n", "### How it works:\n", "1. __Assumption__: The data is generated from a mixture of multiple Gaussian distributions, each defined by a mean and covariance matrix.\n", "2. __Expectation-Maximization (EM) Algorithm__: \n", " - __Expectation step (E-step)__: Computes the probability that each data point belongs to each cluster. \n", " - __Maximization step (M-step)__: Updates the parameters (means, covariances, and mixture weights) of the Gaussians to maximize the likelihood of the data.\n", "3. The process iterates until convergence, refining cluster assignments.\n", "\n", "### Advantages:\n", "- Can model __elliptical clusters__ with different sizes and orientations, unlike K-Means, which assumes spherical clusters.\n", "- Provides probabilistic cluster assignments, making it more flexible in uncertain cases.\n", "- Can handle overlapping clusters better than hard clustering techniques.\n", "\n", "### Limitations:\n", "- Requires selecting the number of components (clusters) in advance.\n", "- Can suffer from local optima, so multiple initializations may be needed.\n", "- Computationally more expensive than K-Means due to iterative probability calculations.\n", "\n", "GMM is widely used in applications such as speaker recognition, image segmentation, anomaly detection, and density estimation.\n", "\n", "A nice tutorial on clustering the iris dataset with GMM can be found [here](http://scikit-learn.org/stable/auto_examples/mixture/plot_gmm_covariances.html#sphx-glr-auto-examples-mixture-plot-gmm-covariances-py). \n", "\n", "Image from [Wikipedia](https://en.wikipedia.org/wiki/Mixture_model)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.mixture import GaussianMixture " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model, Xs, results = cluster_datasets(\n", " GaussianMixture,\n", " preprocess=StandardScaler(),\n", " n_components=3,\n", " random_state=42\n", ")\n", "plot(Xs, results)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise\n", "\n", "Replicate [sklearn guide's example](https://scikit-learn.org/stable/auto_examples/mixture/plot_gmm.html#sphx-glr-auto-examples-mixture-plot-gmm-py)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## [Cluster Validation](http://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation)\n", "\n", "Evaluating the performance of a clustering algorithm is more complex than evaluating supervised learning models. Since clustering is an unsupervised task, there is no direct notion of \"correct\" clusters. Instead, evaluation metrics focus on assessing whether the clusters effectively capture the structure of the data.\n", "\n", "In particular, cluster evaluation should:\n", "- Be invariant to the absolute values of cluster labels (e.g., swapping cluster labels should not change the score).\n", "- Measure whether the clustering reflects meaningful separations in the data.\n", "- Determine whether data points within the same cluster are more similar than points in different clusters, based on a chosen similarity metric.\n", "\n", "Cluster validation approaches fall into two broad categories:\n", "\n", "### 1. Evaluation with Ground Truth \n", "When labeled data is available, we can compare the clustering results against known class labels to assess performance. Common evaluation metrics include:\n", "\n", "- **[Mutual Information-Based Scores](http://scikit-learn.org/stable/modules/clustering.html#mutual-information-based-scores)** \n", " Mutual Information (MI) measures the agreement between predicted cluster labels (`labels_pred`) and true class labels (`labels_true`), ignoring label permutations. Higher MI indicates better clustering performance.\n", "\n", "- **[Homogeneity, Completeness, and V-Measure](https://scikit-learn.org/stable/modules/clustering.html#homogeneity-completeness-and-v-measure)** \n", " These metrics evaluate clustering quality using conditional entropy:\n", " - **Homogeneity**: Each cluster should contain only members of a single class.\n", " - **Completeness**: All members of a given class should be assigned to the same cluster.\n", " - **V-Measure**: The harmonic mean of homogeneity and completeness, providing a balanced evaluation.\n", "\n", "### 2. Evaluation Without Ground Truth \n", "When no labeled data is available, clustering quality is assessed using internal metrics that measure the compactness and separation of clusters.\n", "\n", "- **[Silhouette Coefficient](http://scikit-learn.org/stable/modules/clustering.html#silhouette-coefficient)** \n", " This metric quantifies how similar a sample is to its own cluster compared to other clusters. It is defined for each sample as:\n", "\n", " $$\n", " s = \\frac{b - a}{\\max(a, b)}\n", " $$\n", "\n", " where:\n", " - \\( a \\) = The mean distance between a sample and all other points in the same cluster.\n", " - \\( b \\) = The mean distance between a sample and the nearest neighboring cluster.\n", "\n", " The silhouette score ranges from -1 (poor clustering) to 1 (well-clustered data), with values around 0 indicating overlapping clusters.\n", "\n", "Other internal evaluation metrics include the **Calinski-Harabasz Index** and **Davies-Bouldin Index**, which measure cluster compactness and separation.\n", "\n", "By combining multiple validation techniques, we can better assess clustering performance and choose the most suitable algorithm for a given dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import silhouette_score" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for problem in datasets.keys():\n", " print(problem, silhouette_score(Xs[problem], results[problem], random_state=42))" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise: Clustering movies\n", "\n", "Cluster the [movielens dataset](https://grouplens.org/datasets/movielens/latest/)!\n", "\n", "1. Download and extract the dataset (from [here](https://grouplens.org/datasets/movielens/latest/))\n", "2. Read the readme from the archive\n", "3. Use the datafiles to cluster the movies!\n", "\n", "Hints:\n", "- in movies.csv:\n", " - movie genres can be extracted from genres column\n", " - premiere year can be extracted from the title column (eg: using `r'\\((\\d+)\\)$'` regex)\n", "- re module is your friend (pandas already accepts regexes in str.replace() and str.extract() methods)\n", "- use the preprocessed file from `data/movielens.csv`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "szisz_ds_2025", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.9" } }, "nbformat": 4, "nbformat_minor": 1 }