{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Extract Clusters from Python Dendrogram - adjusted for Clustermap" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Original File: https://nbviewer.jupyter.org/gist/vals/150ec97a5b7db9c82ee9\n", "Blog Post: http://www.nxn.se/valent/extract-cluster-elements-by-color-in-python" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Populating the interactive namespace from numpy and matplotlib\n" ] } ], "source": [ "from collections import defaultdict\n", "from fastcluster import linkage\n", "from matplotlib.colors import rgb2hex, colorConverter\n", "from matplotlib.pyplot import cm\n", "from scipy.cluster import hierarchy\n", "from scipy.cluster.hierarchy import dendrogram, set_link_color_palette\n", "import matplotlib as mpl\n", "import pandas as pd\n", "import seaborn as sns\n", "%pylab inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Packages load" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Amount_offers
geo_bln
Baden_Württemberg26712
Bayern28563
Berlin5191
Brandenburg11689
Bremen1734
Hamburg2787
Hessen19827
Mecklenburg_Vorpommern5795
Niedersachsen28356
Nordrhein_Westfalen44552
Rheinland_Pfalz20778
Saarland3826
Sachsen12531
Sachsen_Anhalt8175
Schleswig_Holstein11157
Thüringen4589
\n", "
" ], "text/plain": [ " Amount_offers\n", "geo_bln \n", "Baden_Württemberg 26712\n", "Bayern 28563\n", "Berlin 5191\n", "Brandenburg 11689\n", "Bremen 1734\n", "Hamburg 2787\n", "Hessen 19827\n", "Mecklenburg_Vorpommern 5795\n", "Niedersachsen 28356\n", "Nordrhein_Westfalen 44552\n", "Rheinland_Pfalz 20778\n", "Saarland 3826\n", "Sachsen 12531\n", "Sachsen_Anhalt 8175\n", "Schleswig_Holstein 11157\n", "Thüringen 4589" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_full = pd.read_csv('df_full.CSV', engine='python', encoding='utf-8', index_col=0)\n", "grouped = df_full.groupby(['geo_bln'])['obj_purchasePrice'].count()\n", "grouped = pd.DataFrame(grouped) \n", "grouped = grouped.rename(columns={'obj_purchasePrice':'Amount_offers'})\n", "grouped.reset_index(inplace=True)\n", "grouped = grouped.set_index('geo_bln')\n", "grouped" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Data load" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "cmap = cm.rainbow(np.linspace(0, 1, 10))\n", "hierarchy.set_link_color_palette([mpl.colors.rgb2hex(rgb[:3]) for rgb in cmap])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lets set the colors" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "link = linkage(grouped, method='single')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First we create the link within the data, we are using the single method but others are possible too, for example:
\n", " -Complete
\n", " -Average
\n", " -Weighted
\n", " -Centroid" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "class Clusters(dict):\n", " def _repr_html_(self):\n", " html = ''\n", " for c in self:\n", " hx = rgb2hex(colorConverter.to_rgb(c))\n", " html += '' \\\n", " ''\n", " html += ''\n", " \n", " html += '
' \\\n", " ''.format(hx)\n", " html += c + '' \n", " html += repr(self[c]) + ''\n", " html += '
'\n", " \n", " return html" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "def get_cluster_classes(den, label='ivl'):\n", " cluster_idxs = defaultdict(list)\n", " for c, pi in zip(den['color_list'], den['icoord']):\n", " for leg in pi[1:3]:\n", " i = (leg - 5.0) / 10.0\n", " if abs(i - int(i)) < 1e-5:\n", " cluster_idxs[c].append(int(i))\n", " print(cluster_idxs)\n", " cluster_classes = Clusters()\n", " for c, l in cluster_idxs.items():\n", " i_l = [den[label][i] for i in l]\n", " cluster_classes[c] = i_l\n", " \n", " return cluster_classes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lets create a function to extract the clusters and display it nicely" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "figsize(8, 7)\n", "den = dendrogram(link, labels=grouped.index, color_threshold=4000)#<-change this to change the depth of the clustering\n", "plt.xticks(rotation=90)\n", "no_spine = {'left': True, 'bottom': True, 'right': True, 'top': True}\n", "sns.despine(**no_spine);\n", "\n", "plt.tight_layout()\n", "plt.savefig('tree1.png');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this formula, we define the dendogram by using the link which was defined before. If you adjust the color_threshhold, you are able to increase/decrease the depth of the clustering. " ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "defaultdict(, {'#8000ff': [2, 3, 1, 9, 10, 8, 7, 6, 5, 4], '#4856fb': [11, 12], '#10a2f0': [14, 15, 13], 'C0': [0]})\n" ] }, { "data": { "text/html": [ "
#8000ff['Brandenburg', 'Schleswig_Holstein', 'Sachsen', 'Berlin', 'Thüringen', 'Mecklenburg_Vorpommern', 'Saarland', 'Hamburg', 'Bremen', 'Sachsen_Anhalt']
#4856fb['Hessen', 'Rheinland_Pfalz']
#10a2f0['Bayern', 'Niedersachsen', 'Baden_Württemberg']
C0['Nordrhein_Westfalen']
" ], "text/plain": [ "{'#8000ff': ['Brandenburg',\n", " 'Schleswig_Holstein',\n", " 'Sachsen',\n", " 'Berlin',\n", " 'Thüringen',\n", " 'Mecklenburg_Vorpommern',\n", " 'Saarland',\n", " 'Hamburg',\n", " 'Bremen',\n", " 'Sachsen_Anhalt'],\n", " '#4856fb': ['Hessen', 'Rheinland_Pfalz'],\n", " '#10a2f0': ['Bayern', 'Niedersachsen', 'Baden_Württemberg'],\n", " 'C0': ['Nordrhein_Westfalen']}" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "get_cluster_classes(den)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And now you can extract the clusters from the HTML body. The clusters are named after the corrosponding colors." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" } }, "nbformat": 4, "nbformat_minor": 1 }