{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Intro to Data Science\n", "## Part VI. - Model Evaluation, Hyperparameter optimization, Clustering\n", "\n", "### Table of contents\n", "\n", "- #### Model evaluation\n", " - Theory\n", " - Classification Metrics\n", " - Accuracy\n", " - Confusion matrix\n", " - Precision, Recall, F1 score\n", " - ROC curve\n", " - Regression Metrics\n", " - Explained variance\n", " - MAE\n", " - MSE\n", " - Clustering Metrics\n", " \n", "- #### Hyperparameter optimization\n", " - Theory\n", " - Cross Validation\n", " - Grid Search Cross Validation\n", " - Randomized Search Cross Validation\n", " - Other Hyperparameter searching methods\n", " \n", "- #### Clustering\n", " - Theory\n", " - Clustering methods\n", " - K-means\n", " - DBSCAN\n", " - Hierarchical clustering\n", " - Spectral clustering\n", " - Gaussian Mixture Models\n", " \n", "---\n", "\n", "# I. Model Evaluation\n", "\n", "## What is Model Evaluation?\n", "\n", "Working with machine learning algorithms, data mining techniques or simple statistical models requires a proper indication if a model is trained properly. Depending on the task, there are several metrics available to estimate the goodness of a fitted model. Beside raw metrics there are some important concepts on how to compare models. Some of them shows if a model is overfitted, some can show if a model is simpler than the other.\n", "\n", "## Why is it important?\n", "\n", "To find the optimal solution for a given problem it is crucial to have an insight about the performance of the proposed models to decide whether to continue the training process, try a different preprocessing pipeline, or a different model.\n", "\n", "## Tools\n", "- Classification metrics\n", " - accuracy\n", " - precision\n", " - recall\n", " - precision-recall curve\n", " - f1 score\n", " - confusion matrix\n", " - ROC\n", "- Regression metrics\n", " - mean absolute error\n", " - RMSE\n", " - explained variance score\n", "- Clustering metrics\n", "- Cross Validiation\n", "- etc." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "\n", "import numpy as np\n", "import pandas as pd\n", "\n", "from sklearn.datasets import load_digits\n", "from sklearn.model_selection import train_test_split\n", "\n", "from sklearn.pipeline import Pipeline\n", "\n", "from sklearn.neural_network import MLPClassifier\n", "from sklearn.linear_model import LogisticRegression\n", "\n", "np.random.seed = 42" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_dig, y_dig = load_digits(return_X_y=True)\n", "X_train, X_test, y_train, y_test = train_test_split(X_dig, y_dig,\n", " test_size=.25,\n", " random_state=42)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nn_pipe = Pipeline([('nn', MLPClassifier(hidden_layer_sizes=(5,), random_state=42))])\n", "nn_pipe.fit(X_train, y_train)\n", "y_hat = nn_pipe.predict(X_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Classification metrics\n", "\n", "### a) [Accuracy](http://scikit-learn.org/stable/modules/model_evaluation.html#accuracy-score)\n", "Accuracy is the ratio of correct prediction and defined by\n", "$$\\texttt{accuracy}(y, \\hat{y}) = \\frac{1}{n_\\text{samples}} \\sum_{i=0}^{n_\\text{samples}-1} 1(\\hat{y}_i = y_i)$$\n", "where $1(x)$ is the indicator function." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import accuracy_score" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "accuracy_score(y_test, y_hat)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### b) Confusion matrix\n", "\n", "A [confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix#Table_of_confusion) shows the number of correct and incorrect predictions made by the classification model compared to the actual outcomes (target value) in the data.\n", "\n", "![Type I and II errors](./pics/confusion_matrix_explained.png) \n", "via [@jimgthornton](https://twitter.com/jimgthornton)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import confusion_matrix" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "confusion_matrix(y_test, y_hat)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sns.heatmap(confusion_matrix(y_test, y_hat))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### c) Precision, recall, f1 score\n", "\n", "_\"Intuitively, precision is the ability of the classifier not to label as positive a sample that is negative, and recall is the ability of the classifier to find all the positive samples.\"_ from [sklearn docs](http://scikit-learn.org/stable/modules/model_evaluation.html#precision-recall-and-f-measures)\n", "\n", "Precision, recall and F1 score uses the notation `true positive`, `true negative`, `false positive` and `false negative`. These values represents the 4 possible outcome from a binary classification problem and can easily understand through the following confusion matrix:\n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Confusion MatrixTarget
PositiveNegative
ModelPositiveab
Negativecd
\n", "\n", "\n", "- __[Precision](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html#sklearn.metrics.precision_score):__\n", "The precision is the ratio `tp / (tp + fp)` where `tp` is the number of true positives and `fp` the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative.\n", "\n", "- __[Recall](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html#sklearn.metrics.recall_score):__\n", "The recall is the ratio `tp / (tp + fn)` where `tp` is the number of true positives and `fn` the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.\n", "\n", "- __[F1 score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score):__\n", "The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. The formula for the F1 score is:\n", "`F1 = 2 * (precision * recall) / (precision + recall)`\n", "In the multi-class and multi-label case, this is the weighted average of the F1 score of each class.\n", "\n", "These metrics are designed to work with binary classification problems, however there are several strategy to compute the multilabel variant as well. One can either compute the mean of the precision|recall|f1 scores (`average='macro'`), or use the unions of the `tp`, `tn`, `fp`, `fn` values to compute the metrics (`average='micro'`). You can read about the different strategies [here](http://scikit-learn.org/stable/modules/model_evaluation.html#multiclass-and-multilabel-classification)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import precision_score\n", "precision_score(y_test, y_hat, average=None)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import recall_score\n", "recall_score(y_test, y_hat, average=None)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import f1_score\n", "f1_score(y_test, y_hat, average=None)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### d) [ROC curve](http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html)\n", "\n", "_“A receiver operating characteristic (ROC), or simply ROC curve, is a graphical plot which illustrates the performance of a binary classifier system as its discrimination threshold is varied. It is created by plotting the fraction of true positives out of the positives (TPR = true positive rate) vs. the fraction of false positives out of the negatives (FPR = false positive rate), at various threshold settings. TPR is also known as sensitivity, and FPR is one minus the specificity or true negative rate.”_ from [sklearn user guide](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)\n", "\n", "The most important value to extract from this graph is the __area under the curve (AUC)__ value, which _\"is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one\"_ from [wiki](https://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_the_curve)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import roc_curve\n", "from sklearn.metrics import auc" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since ROC can only be interpreted in binary classification, we have to transform our multilabel data into the required format. \n", "For this we have to:\n", "\n", "- generate prediction probability for each class" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "y_score = nn_pipe.predict_proba(X_test)\n", "classes = np.unique(y_dig)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- transform labels into binary classes (1 for the current class, 0 otherwise)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def onevsrest(array, label):\n", " return (array == label).astype(int)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- compute __TPR__, __FPR__ and __AUC__ for each class\n", "> `roc_curve` returns the __FPR__, __TPR__ and the __threshold__ arrays" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fpr, tpr, thres, roc_auc = {}, {}, {}, {}\n", "for i in classes:\n", " fpr[i], tpr[i], thres[i] = roc_curve(onevsrest(y_test, i), y_score[:, i])\n", " roc_auc[i] = auc(fpr[i], tpr[i])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- plot all of the curves into one axis" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mycolors = sns.color_palette('muted', n_colors=len(classes))\n", "fig, ax = plt.subplots(figsize=(8, 8))\n", "\n", "ax.set_xlim([0.0, 1.0])\n", "ax.set_ylim([0.0, 1.05])\n", "ax.set_xlabel('False Positive Rate')\n", "ax.set_ylabel('True Positive Rate')\n", "\n", "# plot ROC for random baseline classifier\n", "ax.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')\n", "\n", "# plot ROC for each class\n", "for cls in classes:\n", " label = (f'ROC curve for {cls} (area = {roc_auc[cls]:0.2f})')\n", "\n", " ax.plot(fpr[cls], tpr[cls], color=mycolors[cls], lw=2, label=label)\n", " ax.legend(loc=\"lower right\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Regression metrics" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "reg_pipe = Pipeline([('reg', LogisticRegression(solver='liblinear', multi_class='auto', random_state=42))])\n", "reg_pipe.fit(X_train, y_train)\n", "y_hat = reg_pipe.predict(X_test)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### a) [Explained variance score](http://scikit-learn.org/stable/modules/model_evaluation.html#explained-variance-score)\n", "\n", "If $\\hat{y}$ is the estimated target output, $y$ the corresponding (correct) target output, and $Var$ is Variance, the square of the standard deviation, then the explained variance is estimated as follow:\n", "$$explained\\_variance(y, \\hat{y}) = 1 - \\frac{Var\\{ y - \\hat{y}\\}}{Var\\{y\\}}$$" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import explained_variance_score\n", "explained_variance_score(y_test, y_hat)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### b) [Mean absolute error (`MAE`)](http://scikit-learn.org/stable/modules/model_evaluation.html#mean-absolute-error)\n", "\n", "MAE is a risk metric corresponding to the expected value of the absolute error loss or $l1$-norm loss. \n", "If $\\hat{y}_i$ is the predicted value of the $i$-th sample, and $y_i$ is the corresponding true value, then the mean absolute error (MAE) estimated over $n_{\\text{samples}}$ is defined as\n", "$$\\text{MAE}(y, \\hat{y}) = \\frac{1}{n_{\\text{samples}}} \\sum_{i=0}^{n_{\\text{samples}}-1} \\left| y_i - \\hat{y}_i \\right|.$$" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import mean_absolute_error\n", "mean_absolute_error(y_test, y_hat)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### c) [Mean squared error (`MSE`)](http://scikit-learn.org/stable/modules/model_evaluation.html#mean-squared-error)\n", "\n", "MSE is a risk metric corresponding to the expected value of the squared (quadratic) error loss or loss. \n", "If $\\hat{y}_i$ is the predicted value of the i-th sample, and $y_i$ is the corresponding true value, then the mean squared error (MSE) estimated over $n_{\\text{samples}}$ is defined as\n", "$$\\text{MSE}(y, \\hat{y}) = \\frac{1}{n_\\text{samples}} \\sum_{i=0}^{n_\\text{samples} - 1} (y_i - \\hat{y}_i)^2.$$\n", "\n", "It's widely used variant, the __Root Mean Squared Error (`RMSE`)__ is computed by getting the root of MSE." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import mean_squared_error\n", "mean_squared_error(y_test, y_hat)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "# II. Hyperparameter optimization\n", "\n", "## What is Hyperparameter optimization?\n", "\n", "According to [wikipedia](https://en.wikipedia.org/wiki/Hyperparameter_optimization): \n", "_\"In the context of machine learning, __hyperparameter optimization__ or __model selection__ is the problem of choosing a set of hyperparameters for a learning algorithm, usually with the goal of optimizing a measure of the algorithm's performance on an independent data set. Often cross-validation is used to estimate this generalization performance. \n", "Hyperparameter optimization contrasts with actual learning problems, which are also often cast as optimization problems, but optimize a loss function on the training set alone. In effect, learning algorithms learn parameters that model/reconstruct their inputs well, while hyperparameter optimization is to ensure the model does not overfit its data by tuning, e.g., regularization.\"_\n", "\n", "## Why is it important?\n", "\n", "To find the optimal solution to a given problem, one must train several models with similar predictive/exploratory power and select the simplest one. This process includes selecting models and finding optimal hyperparameters which is a time consuming and tedious work when done by hand. We use automatized solutions to overcome this problem, save time, and yield better results.\n", "\n", "## Tools\n", "- Grid search\n", "- Randomized search\n", "- Bayesian search\n", "- Gradient-base optimization \n", "- TPOT\n", "- etc.\n", "\n", "---\n", "\n", "## Cross Validation\n", "\n", "In order to select an optimal model, first one must be able to measure a model's/pipe's accuracy. \n", "\n", "First, one must select a valid metric for the model. In sklearn, the basic validation metric is accuracy score in case of classification, and $r^{2}$ for regression. Altough, several other metrics can be selected from this list.\n", "\n", "_\"Learning the parameters of a prediction function and testing it on the same data is a methodological mistake: a model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-unseen data. This situation is called overfitting.\"_ [1]. To overcome this problem, one must split the data to __training__ and __test__ dataset; train the model on the train dataset, then measure the precision on the test dataset.\n", "\n", "However different splits can produce different outcomes, so this process must be repeated several times to give a good approximation to the examined model's accuracy. This process is called __Cross Validation__ and there are different strategies to make these splits.\n", "\n", "A simple model can yield different solutions to the same data based on its hyperparameters so multiple models must be trained to select the ideal hyperparamter settings. Cross Validation gives a good approximation to a trained model's accuracy, but additional methods are required to select the ideal hyperparameters. \n", "\n", "[1] Scikit-learn User Guide\n", "\n", "### 1. Grid Search Cross Validation\n", "Grid search is a method which generates a parameter grid from a list of settings, and measure the input model's accuracy in every setting using cross validation." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.decomposition import PCA\n", "from sklearn.model_selection import GridSearchCV" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pipe_digit = Pipeline([\n", " ('pca', PCA(svd_solver='randomized', random_state=42)),\n", " ('logistic', LogisticRegression(solver='liblinear', multi_class='auto', random_state=42))\n", "])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "param_grid = {\n", " 'pca__n_components': [20, 40, 64],\n", " 'logistic__C': np.logspace(-4, 4, 3)\n", "}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "grid_search = GridSearchCV(estimator=pipe_digit, \n", " param_grid=param_grid,\n", " n_jobs=-1,\n", " cv=5,\n", " verbose=1, \n", " return_train_score=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "grid_search.fit(X_dig, y_dig)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "grid_search.best_estimator_.get_params(deep=False)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "grid_search.best_params_, grid_search.best_score_" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "grid_search.cv_results_" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "score_dict = grid_search.cv_results_\n", "hmap = pd.DataFrame({\n", " 'mean': score_dict['mean_test_score'],\n", " 'C': [param['logistic__C'] for param in score_dict['params']],\n", " 'n': [param['pca__n_components'] for param in score_dict['params']]\n", "})" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sns.heatmap(hmap.pivot(index='C', columns='n', values='mean'), annot=True, fmt='.3f');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2. Randomized Search Cross Validation\n", "Randomized search randomly generate a fixed number of hyperparameter setups. It selects the parameters from the provided parameter parameter ranges and then measures them with cross validation." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import RandomizedSearchCV" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "random_search_digit = RandomizedSearchCV(\n", " pipe_digit,\n", " {\n", " 'pca__n_components': np.linspace(1, 64, 64, dtype=int),\n", " 'logistic__C': np.logspace(-4, 4, 30),\n", " },\n", " n_iter=30, \n", " n_jobs=-1,\n", " cv=5,\n", " verbose=1,\n", " return_train_score=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "random_search_digit.fit(X_dig, y_dig)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "random_score_dict = random_search_digit.cv_results_\n", "hmap_r = pd.DataFrame({\n", " 'mean': random_score_dict['mean_test_score'],\n", " 'C': [param['logistic__C'] for param in random_score_dict['params']],\n", " 'n': [param['pca__n_components'] for param in random_score_dict['params']]\n", "})" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "random_search_digit.best_params_, random_search_digit.best_score_" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=(12,10))\n", "sns.heatmap(hmap_r.pivot(index='C', columns='n', values='mean'), annot=True, ax=ax)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3. Other Hyperparameter searching methods\n", "\n", "- TPOT\n", "- auto-sklearn\n", "- hyperopt\n", "- Other hyperparameter searching methods \n", "\n", "---\n", "# III. Clustering\n", "\n", "## What is Clustering?\n", "Clustering is an unsupervised machine learning problem. _\"Unsupervised learning is the machine learning task of inferring a function to describe hidden structure from unlabeled data. Since the examples given to the learner are unlabeled, there is no error or reward signal to evaluate a potential solution.\"_ from: Wiki \n", "\n", "_\"Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters).\"_ from: Wiki\n", "\n", "\n", "## Why is it important?\n", "Often the data does not contain target variables so one must find the hidden structure in the data first in order to achieve his/her goals. In case of recommender systems, it is a common technique to group the similar items together. In some cases the task itself is to find similar/connected/related items in the data. Like in image processing, Social network analysis, medical analysis, or it can be used to find the anomalies in the data.\n", "\n", "## Tools\n", "- K-Means\n", "- Affinity propagation\n", "- Mean-shift\n", "- Spectral clustering\n", "- Ward hierarchical clustering\n", "- Agglomerative clustering\n", "- DBSCAN\n", "- Gaussian mixtures\n", "- Birch\n", "- Support Vector Clustering\n", "- etc." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import make_circles\n", "from sklearn.datasets import make_moons\n", "from sklearn.datasets import make_blobs\n", "from sklearn.datasets import load_iris" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "n_clusters = 3\n", "n_samples = 1500\n", "\n", "iris = load_iris(return_X_y=True)\n", "noisy_circles = make_circles(n_samples=n_samples, factor=.5, noise=.05, random_state=42)\n", "noisy_moons = make_moons(n_samples=n_samples, noise=.05, random_state=42)\n", "blobs = make_blobs(n_samples=n_samples, random_state=42)\n", "no_structure = np.random.rand(n_samples, 2), None\n", "\n", "datasets = {\n", " 'iris': iris,\n", " 'noisy_circles': noisy_circles,\n", " 'noisy_moons': noisy_moons,\n", " 'blobs': blobs,\n", " 'no_structure': no_structure\n", "}\n", "\n", "colors = np.array(sns.color_palette('muted', n_colors=10))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def cluster_datasets(model, preprocess=None, **params):\n", " model = model(**params)\n", " results = {}\n", " Xs = {}\n", " for problem, dataset in datasets.items():\n", " X, y = dataset\n", " if preprocess:\n", " X = preprocess.fit_transform(X, y)\n", " Xs[problem] = X\n", " model.fit(X)\n", " if hasattr(model, 'labels_'):\n", " results[problem] = model.labels_.astype('int')\n", " else:\n", " results[problem] = model.predict(X)\n", " return model, Xs, results" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def plot(Xs, results):\n", " plot_num = 1\n", " plt.figure(figsize=(len(datasets) * 4, 4))\n", " for problem, X in Xs.items():\n", " plt.subplot(1, len(datasets), plot_num)\n", " plt.scatter(X[:, 0], X[:, 1], color=colors[results[problem]], edgecolors='k')\n", " plot_num += 1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. [K-Means](http://scikit-learn.org/stable/modules/clustering.html#k-means)\n", "\n", "\n", "\n", "K-Means clustering intends to partition n objects into k clusters in which each object belongs to the cluster with the nearest mean. This method produces exactly k different clusters of greatest possible distinction. The best number of clusters k leading to the greatest separation (distance) is not known as a priori and must be computed from the data. The objective of K-Means clustering is to minimize total intra-cluster variance, or, the squared error function: \n", "\n", "\n", "\n", "
\n", "\n", "__Algorithm__:\n", "1. Clusters the data into k groups where k is predefined.\n", "2. Select k points at random as cluster centers.\n", "3. Assign objects to their closest cluster center according to the Euclidean distance function.\n", "4. Calculate the centroid or mean of all objects in each cluster.\n", "5. Repeat steps 2, 3 and 4 until the same points are assigned to each cluster in consecutive rounds.\n", "\n", "from: Dr. Saed Sayad's __An Introduction to Data Mining__ book. \n", "\n", "Animation is from the Wikipedia." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.preprocessing import StandardScaler\n", "from sklearn.cluster import KMeans, MiniBatchKMeans" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model, Xs, results = cluster_datasets(\n", " model=KMeans,\n", " preprocess=StandardScaler(),\n", " n_init='auto',\n", " n_clusters=3,\n", " random_state=42\n", ")\n", "plot(Xs, results)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model, Xs, results = cluster_datasets(\n", " MiniBatchKMeans,\n", " preprocess=StandardScaler(),\n", " n_init='auto',\n", " n_clusters=3,\n", " random_state=42\n", ")\n", "plot(Xs, results)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise: Cluster the digits dataset!\n", "\n", "Hints:\n", "- read with sklearn's built-in method\n", "- use standard scaling\n", "- use dimension reduction (pca)\n", "- visualize results\n", "\n", "In case you are lost, follow [sklearn's guide](https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_digits.html#sphx-glr-auto-examples-cluster-plot-kmeans-digits-py)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. [DBSCAN](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html)\n", "\n", "\n", "\n", "*\"The __Density-based spatial clustering of applications with noise (DBSCAN)__ algorithm views __clusters__ as __areas of high density separated by areas of low density__. Due to this rather generic view, clusters found by DBSCAN can be any shape, as opposed to k-means which assumes that clusters are convex shaped. The central component to the DBSCAN is the concept of __core samples__, which are samples that are in areas of high density. A cluster is therefore a set of core samples, each close to each other (measured by some distance measure) and a set of __non-core samples__ () that are close to a core sample (but are not themselves core samples). There are two parameters to the algorithm, `min_samples` and `eps`, which define formally what we mean when we say dense. Higher `min_samples` or lower `eps` indicate higher density necessary to form a cluster.\"* from sklearn's [User Guide](http://scikit-learn.org/stable/modules/clustering.html#dbscan).\n", "\n", "
\n", "\n", "Consider a set of points in some space to be clustered. For the purpose of DBSCAN clustering, the points are classified as __core points__, (density-)__reachable points__ and __outliers__, as follows:\n", "\n", "- A point $p$ is a core point if at least $minPts$ points are within distance $ε$ ($ε$ is the maximum radius of the neighborhood from $p$) of it (including $p$). Those points are said to be directly reachable from $p$. By definition, no points are directly reachable from a non-core point.\n", "- A point $q$ is reachable from $p$ if there is a path $p_1, ..., p_n$ with $p_1 = p$ and $p_n = q$, where each $p_i+1$ is directly reachable from $p_i$ (all the points on the path must be core points, with the possible exception of $q$).\n", "- All points not reachable from any other point are outliers.\n", "\n", "Now if $p$ is a core point, then it forms a cluster together with all points (core or non-core) that are reachable from it. Each cluster contains at least one core point; non-core points can be part of a cluster, but they form its \"edge\", since they cannot be used to reach more points. from [wiki](https://en.wikipedia.org/wiki/DBSCAN#Preliminary)\n", "\n", "Animation is from ProgrammerSought." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.cluster import DBSCAN" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model, Xs, results = cluster_datasets(\n", " DBSCAN,\n", " preprocess=StandardScaler(),\n", " eps=0.3,\n", " min_samples=3\n", ")\n", "plot(Xs, results)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise: Cluster the generated dataset with K-Means and DBSCAN!\n", "\n", "1. Data generation" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X, y = make_blobs(random_state=170, n_samples=600, centers = 5)\n", "rng = np.random.RandomState(42)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "transformation = rng.normal(size=(2, 2))\n", "X = np.dot(X, transformation)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.scatter(X[:, 0], X[:, 1])\n", "plt.xlabel(\"Feature 0\")\n", "plt.ylabel(\"Feature 1\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2. Clustering with K-means" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "3. Clustering with DBSCAN" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Examine and explain the clustering results!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. [Hierarchical Clustering](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html#sklearn.cluster.AgglomerativeClustering)\n", "\n", "\n", "\n", "
\n", "\n", "Hierarchical clustering is a general family of clustering algorithms that build nested clusters by merging or splitting them successively. This hierarchy of clusters is represented as a tree (or __dendrogram__). The root of the tree is the unique cluster that gathers all the samples, the leaves being the clusters with only one sample. See the Wikipedia page for more details.\n", "\n", "The __Agglomerative Clustering__ performs a hierarchical clustering using a bottom up approach: each observation starts in its own cluster, and clusters are successively merged together. The linkage criteria determines the metric used for the merge strategy:\n", "\n", "- __Ward__ minimizes the sum of squared differences within all clusters. It is a variance-minimizing approach and in this sense is similar to the k-means objective function but tackled with an agglomerative hierarchical approach. \n", "- __Maximum__ or __complete linkage__ minimizes the maximum distance between observations of pairs of clusters. \n", "- __Average linkage__ minimizes the average of the distances between all observations of pairs of clusters. \n", "\n", "Agglomerative Clustering can also scale to large number of samples when it is used jointly with a connectivity matrix, but is computationally expensive when no connectivity constraints are added between samples: it considers at each step all the possible merges. - from sklearn's [User Guide](http://scikit-learn.org/stable/modules/clustering.html#hierarchical-clustering).\n", "\n", "Animation is from ProgrammerSought." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.cluster import AgglomerativeClustering" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- complete" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model, Xs, results = cluster_datasets(\n", " AgglomerativeClustering,\n", " preprocess=StandardScaler(),\n", " n_clusters=3,\n", " linkage='complete',\n", ")\n", "plot(Xs, results)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.neighbors import kneighbors_graph" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def cluster_connections(**params):\n", " results = {}\n", " Xs = {}\n", " models = {}\n", " for problem, dataset in datasets.items():\n", " X, y = dataset\n", " X = StandardScaler().fit_transform(X, y)\n", " Xs[problem] = X\n", " connectivity = kneighbors_graph(X, n_neighbors=10, include_self=False)\n", " connectivity = 0.5 * (connectivity + connectivity.T)\n", " model = AgglomerativeClustering(connectivity=connectivity, **params)\n", " model.fit(X)\n", " results[problem] = model.labels_.astype('int')\n", " models[problem] = model\n", " return models, Xs, results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- ward" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "models, Xs, results = cluster_connections(\n", " linkage='ward',\n", " n_clusters=2,\n", ")\n", "plot(Xs, results)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- average" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "models, Xs, results = cluster_connections(\n", " linkage=\"average\",\n", " n_clusters=2,\n", ")\n", "plot(Xs, results)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise: Generating dendrograms\n", "\n", "1. Generate small dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_dummy, y_dummy = make_blobs(n_samples=10, n_features=2, random_state=42)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots()\n", "ax.scatter(X_dummy[:, 0], X_dummy[:, 1], c=y_dummy)\n", "for i in range(len(y_dummy)):\n", " ax.annotate(i, (X_dummy[i, 0], X_dummy[i, 1]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2. Use scipy's method to generat dendrogram" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from scipy.cluster.hierarchy import dendrogram, linkage\n", "Z = linkage(X_dummy)\n", "dendrogram(Z);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. [Spectral Clustering](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.SpectralClustering.html)\n", "\n", "_\"Spectral Clustering does a __low-dimension embedding__ of the __affinity matrix__ between samples (similarity matrix), followed by a __K-Means in the low dimensional space__. It is especially efficient if the affinity matrix is sparse. SpectralClustering requires the number of clusters to be specified. It works well for a small number of clusters but is not advised when using many clusters.\"_ - from sklearn's [User Guide](http://scikit-learn.org/stable/modules/clustering.html#spectral-clustering)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.cluster import SpectralClustering" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model, Xs, results = cluster_datasets(\n", " SpectralClustering,\n", " preprocess=StandardScaler(),\n", " n_clusters=2,\n", " gamma=1e1,\n", " random_state=42\n", ")\n", "plot(Xs, results)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. [Gaussian Mixture Models](http://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html)\n", "\n", "_\"A Gaussian mixture model is a probabilistic model that assumes all the data points are generated from a mixture of a finite number of Gaussian distributions with unknown parameters. One can think of mixture models as generalizing k-means clustering to incorporate information about the covariance structure of the data as well as the centers of the latent Gaussians.\"_ - from sklearn's [User Guide](http://scikit-learn.org/stable/modules/mixture.html#gaussian-mixture-models)\n", "\n", "A nice tutorial on clustering the iris dataset can be found [here](http://scikit-learn.org/stable/auto_examples/mixture/plot_gmm_covariances.html#sphx-glr-auto-examples-mixture-plot-gmm-covariances-py)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.mixture import GaussianMixture " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model, Xs, results = cluster_datasets(\n", " GaussianMixture,\n", " preprocess=StandardScaler(),\n", " n_components=3,\n", " random_state=42\n", ")\n", "plot(Xs, results)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise\n", "\n", "Replicate [sklearn guide's example](https://scikit-learn.org/stable/auto_examples/mixture/plot_gmm.html#sphx-glr-auto-examples-mixture-plot-gmm-py)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## [Cluster Validation](http://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation)\n", "\n", "Evaluating the performance of a clustering algorithm is not as trivial as counting the number of errors or the precision and recall of a supervised classification algorithm. In particular any evaluation metric should not take the absolute values of the cluster labels into account but rather if this clustering define separations of the data similar to some ground truth set of classes or satisfying some assumption such that members belong to the same class are more similar that members of different classes according to some similarity metric.\n", "\n", "There are two separate approach to this problem: \n", "\n", "- __Knowing the ground truth__: \n", " To test how good a clustering algorithm generally performs, we can apply it to datasets with known labels. This shows how well it will perform on unknown data.\n", " - [mutual information based scores](http://scikit-learn.org/stable/modules/clustering.html#mutual-information-based-scores): Given the knowledge of the ground truth class assignments `labels_true` and our clustering algorithm assignments of the same samples `labels_pred`, the __Mutual Information__ is a function that measures the agreement of the two assignments, ignoring permutations.\n", " - [homogeneity, completeness and V-measure](https://scikit-learn.org/stable/modules/clustering.html#homogeneity-completeness-and-v-measure): Given the knowledge of the ground truth class assignments of the samples, it is possible to define some intuitive metric using conditional entropy analysis. Two of such measure is:\n", " - __homogeneity__: each cluster contains only members of a single class.\n", " - __completeness__: all members of a given class are assigned to the same cluster\n", " - their harmonic mean called __V-measure__.\n", "\n", "- __Without knowing the ground truth__: \n", " Measuring the generated cluster metrics to determine the goodness of the clustering. One example for such a metric is the:\n", " - [silhouette coefficient](http://scikit-learn.org/stable/modules/clustering.html#silhouette-coefficient) which is defined for each sample and is composed of two scores:\n", " \n", " - a: The mean distance between a sample and all other points in the same class.\n", " - b: The mean distance between a sample and all other points in the next nearest cluster. \n", " \n", " The Silhouette Coefficient $s$ for a single sample is then given as:\n", " $s = \\frac{b - a}{max(a, b)}$\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import silhouette_score" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for problem in datasets.keys():\n", " print(problem, silhouette_score(Xs[problem], results[problem], random_state=42))" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise: Clustering movies\n", "\n", "Cluster the [movielens dataset](https://grouplens.org/datasets/movielens/latest/)!\n", "\n", "1. Download and extract the dataset (from [here](https://grouplens.org/datasets/movielens/latest/))\n", "2. Read the readme from the archive\n", "3. Use the datafiles to cluster the movies!\n", "\n", "Hints:\n", "- in movies.csv:\n", " - movie genres can be extracted from genres column\n", " - premiere year can be extracted from the title column (eg: using `r'\\((\\d+)\\)$'` regex)\n", "- re module is your friend (pandas already accepts regexes in str.replace() and str.extract() methods)\n", "- use the preprocessed file from `data/movielens.csv`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "szisz_ds_23", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.16" }, "vscode": { "interpreter": { "hash": "d0dc57cccaa2d0d305072673e9fe47ca44c23e744f2f05647d6b373471557b60" } } }, "nbformat": 4, "nbformat_minor": 1 }