{ "cells": [ { "cell_type": "markdown", "metadata": { "tags": [ "s1", "content", "l1" ] }, "source": [ "# Agglomerative Clustering\n", "\n", "## Agglomerative Clustering\n", "\n", "### Algorithm\n", "\n", "This is a type of hierarchical clustering. First pairs of data points which are similar to each other are grouped together to form clusters. Then iteratively, the clusters are grouped together based on nearness criterion. This method is followed to form a heirarchy or a tree of clusters till we get back all the data merged to form a root. A cut can be taken at any point to visualize the groupings or cluster associations of each data point. \n", "\n", "### Using Scipy\n", "\n", "In order to perform agglomerative clustering, we need to import two functions from the scipy.cluster.hierarchy library - dendrogram and linkage. The linkage function is the main function which calculates the distances between every combination of data points within the given data set, using the specified method and metric (Eg. linkage(data_set, method_of_distance_measurement, metric)). These distances so calculated are to be stored in a matrix. This function automatically classifies the data points into clusters with each iteration, merging data points/clusters which are closest to each other.\n", "\n", "```python\n", "from scipy.cluster.hierarchy import dendrogram, linkage\n", "\n", "clust_dist = linkage(moon_df, 'average', 'euclidean')\n", "```\n", "\n", "### Dendogram\n", "\n", "A dendrogram is a tree like structure which displays the clusters formed at each step. To visulize the clusters at any step, you need to take a cut. It provides a lot of information in terms of which data points were merged at that cut. A horizontal line in the dendrogram denotes the occurence of that cut and a merge. The vertical lines dropping from the ends of the horizontal line, point to the clusters (or data points) which were merged. The length of the vertical line (also referred to as height of the horizontal line) denotes the distance covered from the cluster. In subsequent iterations/merges, we can observe that the heights are different, because the distance between one cluster and the resulting cluster may not be the same.\n", "\n", "\n", "
\n", "## Exercise:\n", "\n", "Use the dendrogram function to show the agglomerative clustering performed on the dataset.\n", "\n", "```python\n", "DF = dendrogram(\n", " clust_dist,\n", " leaf_rotation=90., # rotates the x axis labels\n", " leaf_font_size=8., # font size for the x axis labels\n", ")\n", "```\n", "\n", "\n", "\n", "The dendrogram shown can be truncated to show the last few merged clusters. \n", "\n", "- Truncate the dendogram to show last 12 merged clusters. Use the parameters:\n", " - truncate_mode='lastp', # show only the last p merged cluster\n", " - p=12, # show only the last p merged clusters\n", " - show_leaf_counts=True, # otherwise numbers in brackets are counts" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "tags": [ "s1", "ce", "l1" ] }, "outputs": [], "source": [ "from matplotlib import pyplot as plt\n", "from sklearn.datasets import make_moons\n", "from scipy.cluster.hierarchy import dendrogram, linkage\n", "\n", "import numpy as np\n", "import seaborn as sns\n", "import pandas as pd\n", "\n", "plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots\n", "plt.rcParams['image.interpolation'] = 'nearest'\n", "plt.rcParams['image.cmap'] = 'gray'\n", "\n", "# defining N, D and K\n", "\n", "N_Samples = 1000\n", "D = 2 \n", "K = 4\n", "\n", "X, y = make_moons(n_samples = 2*N_Samples, noise=0.05, shuffle = False)\n", "x_vec, y_vec = make_moons(n_samples = 2*N_Samples, noise=0.08, shuffle = False)\n", "x_vec[:,0] += 4\n", "y_vec += 4\n", "X = np.concatenate((X, x_vec), axis=0)\n", "y = np.concatenate((y, y_vec), axis=0)\n", "\n", "# converting the data set into a data frame and displaying the plot\n", "moon_df = pd.DataFrame({'X_0':X[:,0],'X_1':X[:,1], 'y':y})\n", "g = sns.pairplot(x_vars=\"X_0\", y_vars=\"X_1\", hue=\"y\", data = moon_df)\n", "g.fig.set_size_inches(14, 6)\n", "sns.despine()\n", "\n", "# Calculating linkages among data points\n", "clust_dist = linkage(moon_df, 'average', 'euclidean')\n", "\n", "# Modify the code below \n", "DF = dendrogram(\n", "clust_dist,\n", "leaf_rotation=90., # rotates the x axis labels\n", "leaf_font_size=8., # font size for the x axis labels\n", ")\n" ] }, { "cell_type": "markdown", "metadata": { "tags": [ "s1", "l1", "hint" ] }, "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "s1", "l1", "ans" ] }, "outputs": [], "source": [ "DF = dendrogram(\r\n", "clust_dist,\r\n", "truncate_mode='lastp', # show only the last p merged clusters\r\n", "p=12, # show only the last p merged clusters\r\n", "show_leaf_counts=True, # otherwise numbers in brackets are counts\r\n", "leaf_rotation=90.,\r\n", "leaf_font_size=12.,\r\n", "show_contracted=True, # to get a distribution impression in truncated branches\r\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "s1", "hid", "l1" ] }, "outputs": [], "source": [ "ref_tmp_var = False\n", "\n", "\n", "try:\n", " ref_assert_var = False\n", " DF2 = dendrogram(\r\n", " clust_dist,\r\n", " truncate_mode='lastp', # show only the last p merged clusters\r\n", " p=12, # show only the last p merged clusters\r\n", " show_leaf_counts=True, # otherwise numbers in brackets are counts\r\n", " leaf_rotation=90.,\r\n", " leaf_font_size=12.,\r\n", " show_contracted=True, # to get a distribution impression in truncated branches\r\n", " )\n", " \n", " import numpy as np\r\n", " \r\n", " if DF['icoord'] == DF2['icoord']:\r\n", " ref_assert_var = True\r\n", " else:\r\n", " ref_assert_var = False\n", " \n", "except Exception:\n", " print('Please follow the instructions given and use the same variables provided in the instructions.')\n", "else:\n", " if ref_assert_var:\n", " ref_tmp_var = True\n", " else:\n", " print('Please follow the instructions given and use the same variables provided in the instructions.')\n", "\n", "\n", "assert ref_tmp_var" ] }, { "cell_type": "markdown", "metadata": { "tags": [ "l2", "content", "s2" ] }, "source": [ "\n", "


\n", "## The Cut - Agglomerative Clustering\n", "\n", "In the above dendrogram we can see that we truncated the agglomerative clustering iterations at a point where we can see that all data points were merged to formed 12 different clusters. We can now look up the clusters using 'fcluster' method from \"scipy.cluster.heirarchy\" library. To visualize what data points are associated with each cluster, import the library:\n", "\n", "```python\n", "from scipy.cluster.hierarchy import fcluster\n", "k = 12\n", "labels = fcluster(clustdist, k, criterion='maxclust')\n", "```\n", "\n", "To visualize the labels and verify the length,\n", "\n", "```python\n", "labels, len(labels)\n", "(array([11, 11, 11, ..., 4, 4, 4], dtype=int32), 4000)\n", "```\n", "\n", "
\n", "## Exercise:\n", "\n", " - Visualize the cluster and assign the cluster labels to the column 'agglo'" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "tags": [ "l2", "ce", "s2" ] }, "outputs": [], "source": [ "from scipy.cluster.hierarchy import fcluster\n", "from scipy.cluster.hierarchy import dendrogram, linkage\n", "\n", "# Calculating linkages among data points\n", "clust_dist = linkage(moon_df, 'average', 'euclidean')\n", "\n", "k = 4\n", "labels = fcluster(clust_dist, k, criterion='maxclust')\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "tags": [ "l2", "s2", "hint" ] }, "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "l2", "s2", "ans" ] }, "outputs": [], "source": [ "moon_df['agglo'] = labels\r\n", "g = sns.pairplot(x_vars=\"X_0\", y_vars=\"X_1\", hue=\"agglo\", data = moon_df)\r\n", "g.fig.set_size_inches(14, 6)\r\n", "sns.despine()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "l2", "hid", "s2" ] }, "outputs": [], "source": [ "ref_tmp_var = False\n", "\n", "\n", "try:\n", " ref_assert_var = False\n", " import numpy as np\n", " \n", " if np.all(moon_df['agglo'] == labels):\r\n", " ref_assert_var = True\r\n", " out = g\r\n", " else:\r\n", " ref_assert_var = False\n", " \n", "except Exception:\n", " print('Please follow the instructions given and use the same variables provided in the instructions.')\n", "else:\n", " if ref_assert_var:\n", " ref_tmp_var = True\n", " else:\n", " print('Please follow the instructions given and use the same variables provided in the instructions.')\n", "\n", "\n", "assert ref_tmp_var" ] } ], "metadata": { "executed_sections": [], "rf_version": 1 }, "nbformat": 4, "nbformat_minor": 2 }