{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "name": "Clustering.ipynb", "provenance": [], "collapsed_sections": [] }, "kernelspec": { "name": "python3", "display_name": "Python 3" }, "language_info": { "name": "python" } }, "cells": [ { "cell_type": "markdown", "source": [ "# Introduction \n", "\n" ], "metadata": { "id": "PNmnb8Nf2G1a" } }, { "cell_type": "markdown", "source": [ "Clustering refers to a very broad set of techniques for finding *subgroups*, or *clusters*, in a data set. When we cluster the observations of a data set, we seek to partition them into distinct groups so that the observations within each group are quite similar to each other.\n", "\n", "For instance, suppose that we have a set of $n$ observations, each with $p$ features. The $n$ observations could correspond to tissue samples for patients with breast cancer, and the $p$ features could correspond to measurements collected for eah tissue sample, these could be clinical measurements, such as tumor stage or grade, or they could be gene expression measurements. \n", "\n", "We may have a reason to believe that there are a few different *unknown* subtype of breast cancer. \n", "\n", " Clustering could be used to find these subgroups. This is an unsupervised problem because we are trying to discover the structure - in this case, distinct clusters - on the basis of a data set. \n", "\n", "Both clustering and PCA seek to simplify the data via a small number of summaries, but their mechanisms are different.\n", "\n", "- PCA looks to find a low-dimensional representation of the observations that explain a good fracton of the variance. \n", "\n", "- Clustering looks to find homogeneous subgroups among the observations." ], "metadata": { "id": "bpWDKqsSfv-g" } }, { "cell_type": "markdown", "source": [ "#K-Means Algorithm\n", "\n", "\n", "\n" ], "metadata": { "id": "8UpHPtXk2R1t" } }, { "cell_type": "markdown", "source": [ "1. Specify number of clusters $K$.\n", "2. Initialize centtroids by first shuffling the dataset and then randomly selecting $K$ data points for the centroids without replacement.\n", "3. Keep iterating until there is no change to the centroids i.e assignment of data points to clusters isn't changing\n", " - Compute the sum of the square distance between data points and all centroids\n", " - Assign each data point to the closest cluster(centroid).\n", " - Compute the centroids for the clusters by taking the average of the all data points that belong to each cluster." ], "metadata": { "id": "2mVcCYXm4Yxi" } }, { "cell_type": "markdown", "source": [ "# Implementation" ], "metadata": { "id": "InfG7sMa3lO3" } }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "7sNxr1R2079x", "outputId": "1673d685-80f2-4207-966f-89977fdfa87f" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Mounted at /content/drive\n" ] } ], "source": [ "import random\n", "from google.colab import drive\n", "drive.mount('/content/drive')" ] }, { "cell_type": "code", "source": [ "import pandas as pd\n", "import numpy as np\n", "import seaborn as sns\n", "import random\n", "import matplotlib.pyplot as plt\n", "\n", "\n", "%cd /content/drive/My\\ Drive/colab_notebooks/machine_learning/data/\n", "df = pd.read_csv(\"clustering.csv\")" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "8SbZHVxU1DDB", "outputId": "0e2b49e1-94c8-4cf8-aa8c-0c1760bc7897" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "/content/drive/My Drive/colab_notebooks/machine_learning/data\n" ] } ] }, { "cell_type": "code", "source": [ "df = df[['ApplicantIncome','LoanAmount']]\n", "\n", "y1 = df['ApplicantIncome']\n", "n_bins = 20\n", "plt.hist(y1, bins=n_bins,edgecolor = \"white\")\n", "plt.show()" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 265 }, "id": "a2tv1u591UEU", "outputId": "55e359b6-673c-4c1c-e53f-c4916a08476b" }, "execution_count": null, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "<Figure size 432x288 with 1 Axes>" ], "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXoAAAD4CAYAAADiry33AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAP2ElEQVR4nO3db4xldX3H8fenu4tr0bIg080WtLNGouGJQicUojEWqlIwwgNCMEa3FrNJWxutTXSpDxqTPsCm8V/aqBvRbhr/QFELgVRLV0zTJ6tDQQUWyoKgS4AdrKg1oQJ+++D+lo7rLnPn/tmZ+c37ldzcc37nnDvf3z13PnPu755zJ1WFJKlfv7bSBUiSpsugl6TOGfSS1DmDXpI6Z9BLUuc2Hs8fduqpp9bs7Ozx/JGStObddtttj1fVzKjbH9egn52dZX5+/nj+SEla85I8NM72Dt1IUucMeknqnEEvSZ0z6CWpcwa9JHXOoJekzhn0ktQ5g16SOmfQS1LnDHr9iiefemZFt5c0Wcf1KxC0NmzetIHZXTePvP2DV188wWokjcsjeknqnEEvSZ0z6CWpcwa9JHXOoJekzhn0ktQ5g16SOmfQS1LnDHpJ6pxBL0mdGyrok2xJcn2Se5LsT3JeklOS3JLkvnZ/8rSLlSQt37BH9B8DvlpVrwBeCewHdgF7q+oMYG+blyStMksGfZKTgNcC1wBU1c+r6gngEmBPW20PcOm0ipQkjW6YI/rtwALw2SS3J/l0khOBrVX1SFvnUWDrtIqUJI1umKDfCJwNfKKqzgJ+xhHDNFVVQB1t4yQ7k8wnmV9YWBi3XknSMg0T9AeBg1W1r81fzyD4H0uyDaDdHzraxlW1u6rmqmpuZmZmEjVLkpZhyaCvqkeBHyR5eWu6ALgbuBHY0dp2ADdMpUJJ0liG/Q9TfwZ8LskJwAPAOxj8kbguyZXAQ8Dl0ylRkjSOoYK+qu4A5o6y6ILJliNJmjSvjJWkzhn0ktQ5g16SOmfQS1LnDHpJ6pxBL0mdM+glqXMGfaeefOqZlS5B0iox7JWxWmM2b9rA7K6bR9r2wasvnnA1klaSR/SS1DmDXpI6Z9BLUucMeknqnEEvSZ0z6CWpcwa9JHXOoJekzhn0mrhxrsr1il5p8rwyVhPnVbnS6uIRvSR1zqCXpM4Z9JLUOYNekjo31IexSR4Efgo8AzxdVXNJTgGuBWaBB4HLq+pH0ylzfXryqWfYvGnDSpchaY1bzlk3v1dVjy+a3wXsraqrk+xq8++faHXrnGevSJqEcYZuLgH2tOk9wKXjlyNJmrRhg76Af01yW5KdrW1rVT3Sph8Ftk68OknS2IYdunlNVT2c5DeBW5Lcs3hhVVWSOtqG7Q/DToCXvOQlYxUrSVq+oY7oq+rhdn8I+ApwDvBYkm0A7f7QMbbdXVVzVTU3MzMzmaolSUNbMuiTnJjkhYengTcAdwI3AjvaajuAG6ZVpCRpdMMM3WwFvpLk8Pqfr6qvJvkWcF2SK4GHgMunV6YkaVRLBn1VPQC88ijtPwQumEZRkqTJ8cpYSeqcQS9JnTPoJalzBr0kdc6gl6TOGfSS1DmDXpI6Z9BLUucMeknqnEEvSZ0z6CWpcwa9JHXOoJekzhn0ktQ5g16SOmfQS1LnDHpJ6pxBL0mdM+glqXMGvSR1zqCXpM4Z9JLUOYNekjo3dNAn2ZDk9iQ3tfntSfYlOZDk2iQnTK9MSdKolnNE/25g/6L5DwEfqaqXAT8CrpxkYZKkyRgq6JOcDlwMfLrNBzgfuL6tsge4dBoFSpLGM+wR/UeB9wG/aPMvAp6oqqfb/EHgtKNtmGRnkvkk8wsLC2MVK0laviWDPsmbgENVddsoP6CqdlfVXFXNzczMjPIQkqQxbBxinVcDb05yEbAZ+A3gY8CWJBvbUf3pwMPTK1OSNKolj+ir6qqqOr2qZoErgK9X1VuBW4HL2mo7gBumVqUkaWTjnEf/fuC9SQ4wGLO/ZjIlSZImaZihm2dV1TeAb7TpB4BzJl+SJGmSvDJWkjpn0EtS5wx6SeqcQS9JnTPoJalzBr0kdc6gl6TOGfSS1DmDXpI6Z9BLUucMeknqnEEvSZ0z6CWpcwa9VpUnn3pmRbaVerasrymWpm3zpg3M7rp5pG0fvPriCVcj9cEjeknqnEEvSZ0z6CWpcwa9JHXOoJekzhn0ktQ5g16SOmfQS1Lnlgz6JJuTfDPJt5PcleSDrX17kn1JDiS5NskJ0y9XkrRcwxzR/y9wflW9EngVcGGSc4EPAR+pqpcBPwKunF6ZkqRRLRn0NfA/bXZTuxVwPnB9a98DXDqVCiVJYxlqjD7JhiR3AIeAW4D7gSeq6um2ykHgtGNsuzPJfJL5hYWFSdS8pvhFW8ePX4gmHd1QX2pWVc8Ar0qyBfgK8Iphf0BV7QZ2A8zNzdUoRa5lfknX8eNzLR3dss66qaongFuB84AtSQ7/oTgdeHjCtUmSJmCYs25m2pE8SZ4PvB7YzyDwL2ur7QBumFaRkqTRDTN0sw3Yk2QDgz8M11XVTUnuBr6Y5K+B24FrplinJGlESwZ9VX0HOOso7Q8A50yjKEnS5HhlrCR1zqCXpM4Z9JLUOYNekjpn0EtS5wx6SeqcQS9JnTPoJalzBr0kdc6gl6TOGfSS1DmDXpI6Z9BLUucMeknqnEEvSZ0z6CWpcwa9JHXOoJekzhn0ktQ5g16SOmfQS1LnDHpJ6pxBL0mdWzLok7w4ya1J7k5yV5J3t/ZTktyS5L52f/L0y5UkLdcwR/RPA39RVWcC5wJ/muRMYBewt6rOAPa2eUnSKrNk0FfVI1X1n236p8B+4DTgEmBPW20PcOm0ipQkjW5ZY/RJZoGzgH3A1qp6pC16FNh6jG12JplPMr+wsDBGqZKkUQwd9EleAHwJeE9V/WTxsqoqoI62XVXtrqq5qpqbmZkZq1hJ0vINFfRJNjEI+c9V1Zdb82NJtrXl24BD0ylRkjSOYc66CXANsL+qPrxo0Y3Ajja9A7hh8uVJksa1cYh1Xg28Dfhukjta218CVwPXJbkSeAi4fDolSpLGsWTQV9V/ADnG4gsmW44kadK8MlaSOmfQS1LnDHpJ6pxBL0mdM+glqXMGvSR1zqCXpM4Z9JLUOYNekjpn0EtS5wx6SeqcQS9JnTPoJalzBr00piefemZFt5eWMsz30Ut6Dps3bWB2180jb//g1RdPsBrpV3lEL0mdM+glqXMGvSR1zqCXpM4Z9JLUOYNewlMc1TdPr5QY7xRJT4/UaucRvSR1bsmgT/KZJIeS3Lmo7ZQktyS5r92fPN0yJUmjGuaI/h+AC49o2wXsraozgL1tXpK0Ci0Z9FX178B/H9F8CbCnTe8BLp1wXZKkCRl1jH5rVT3Sph8Fth5rxSQ7k8wnmV9YWBjxx0mSRjX2h7FVVUA9x/LdVTVXVXMzMzPj/jhJ0jKNGvSPJdkG0O4PTa4kSdIkjRr0NwI72vQO4IbJlCOtP+NcrOWFXhrGkhdMJfkC8Drg1CQHgb8CrgauS3Il8BBw+TSLlHrmxVqatiWDvqrecoxFF0y4FknSFHhlrCR1zqCXpM4Z9JLUOYNekjpn0EtS5wx6SeqcQS9JnTPopXXKK3LXD/+VoLROeUXu+uERvSR1zqCX1jCHUDQMh26kNczhFw3DI3pJ6pxBvwTfGkta6xy6WcI4b43Bt8eSVp5H9JLUuXUR9A6/SJM17u+Uv5PH17oYuvHMBGmyHNJcW9bFEb0krWdrJuh9qydpJa3l7wZaM0M3Dr9IWklrOYPWzBG9JGk0Br0kdW6soE9yYZJ7kxxIsmtSRUnSsazlsfKVMvIYfZINwN8DrwcOAt9KcmNV3T2p4iTpSOOOla/VcfZxjHNEfw5woKoeqKqfA18ELplMWZKkSUlVjbZhchlwYVW9s82/DfjdqnrXEevtBHa22ZcD9w7x8KcCj49UWB/s//rt/3ruO9j/Y/X/t6tqZtQHnfrplVW1G9i9nG2SzFfV3JRKWvXs//rt/3ruO9j/afV/nKGbh4EXL5o/vbVJklaRcYL+W8AZSbYnOQG4ArhxMmVJkiZl5KGbqno6ybuArwEbgM9U1V0TqmtZQz0dsv/r13ruO9j/qfR/5A9jJUlrg1fGSlLnDHpJ6tyqC/oev1YhyYuT3Jrk7iR3JXl3az8lyS1J7mv3J7f2JPl4ew6+k+TsRY+1o61/X5IdK9WnUSTZkOT2JDe1+e1J9rV+Xts+1CfJ89r8gbZ8dtFjXNXa703yxpXpyfIl2ZLk+iT3JNmf5Lz1sv+T/Hl73d+Z5AtJNve875N8JsmhJHcuapvYvk7yO0m+27b5eJIsWVRVrZobgw917wdeCpwAfBs4c6XrmkC/tgFnt+kXAv8FnAn8DbCrte8CPtSmLwL+BQhwLrCvtZ8CPNDuT27TJ690/5bxPLwX+DxwU5u/DriiTX8S+OM2/SfAJ9v0FcC1bfrM9pp4HrC9vVY2rHS/huz7HuCdbfoEYMt62P/AacD3gOcv2ud/2PO+B14LnA3cuahtYvsa+GZbN23bP1iyppV+Uo54gs4DvrZo/irgqpWuawr9vIHBdwTdC2xrbduAe9v0p4C3LFr/3rb8LcCnFrX/0nqr+cbgOou9wPnATe1F+jiw8ch9z+BMrvPa9Ma2Xo58PSxebzXfgJNa2OWI9u73fwv6H7TA2tj2/Rt73/fA7BFBP5F93Zbds6j9l9Y71m21Dd0cflEcdrC1daO9FT0L2AdsrapH2qJHga1t+ljPw1p+fj4KvA/4RZt/EfBEVT3d5hf35dl+tuU/buuv1f5vBxaAz7ahq08nOZF1sP+r6mHgb4HvA48w2Je3sX72/WGT2tentekj25/Tagv6riV5AfAl4D1V9ZPFy2rw57nLc12TvAk4VFW3rXQtK2Qjg7fyn6iqs4CfMXj7/qxe938bi76EwR+73wJOBC5c0aJW2Ers69UW9N1+rUKSTQxC/nNV9eXW/FiSbW35NuBQaz/W87BWn59XA29O8iCDbzk9H/gYsCXJ4Yv2Fvfl2X625ScBP2Tt9v8gcLCq9rX56xkE/3rY/78PfK+qFqrqKeDLDF4P62XfHzapff1wmz6y/TmttqDv8msV2qfi1wD7q+rDixbdCBz+NH0Hg7H7w+1vb5/Inwv8uL3t+xrwhiQntyOlN7S2Va2qrqqq06tqlsE+/XpVvRW4FbisrXZk/w8/L5e19au1X9HOzNgOnMHgg6lVraoeBX6Q5OWt6QLgbtbH/v8+cG6SX2+/B4f7vi72/SIT2ddt2U+SnNuez7cveqxjW+kPLY7yIcZFDM5KuR/4wErXM6E+vYbBW7XvAHe020UMxh73AvcB/wac0tYPg3/qcj/wXWBu0WP9EXCg3d6x0n0b4bl4Hf9/1s1LGfyyHgD+CXhea9/c5g+05S9dtP0H2vNyL0OcbbBabsCrgPn2GvhnBmdSrIv9D3wQuAe4E/hHBmfOdLvvgS8w+DziKQbv5q6c5L4G5tpzeT/wdxzxIf/Rbn4FgiR1brUN3UiSJsygl6TOGfSS1DmDXpI6Z9BLUucMeknqnEEvSZ37P14RVS/UxfdtAAAAAElFTkSuQmCC\n" }, "metadata": { "needs_background": "light" } } ] }, { "cell_type": "code", "source": [ "y1 = df['LoanAmount']\n", "n_bins = 20\n", "plt.hist(y1, bins=n_bins,edgecolor = \"white\")\n", "plt.show()" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 265 }, "id": "F2vvGjzG1-f3", "outputId": "2936cf51-9aff-4d5d-811e-a010f62db1a6" }, "execution_count": null, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "<Figure size 432x288 with 1 Axes>" ], "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXAAAAD4CAYAAAD1jb0+AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAO9UlEQVR4nO3dbYxcV33H8e+vaxMXaJuEBNeNo65bIlCKSoJWaSJ4QRMeAkEklSIUhKiruvIbUEOLRB2QKiH1hVErHipRWovQWFUKgQCNlaikqQmqKlWBNQ8hiUljgim2nHhpE6Ct3Nrm3xdzN13Wu57x7szOHPv7kUZ777n3av462v3t2XPvnE1VIUlqz8+MuwBJ0soY4JLUKANckhplgEtSowxwSWrUurV8s4suuqimp6fX8i0lqXn79u37QVVdvLh9TQN8enqa2dnZtXxLSWpeku8t1e4UiiQ1aqAReJKDwI+Bk8CJqppJciFwFzANHATeWlXPjKZMSdJiZzIC/82quqKqZrr9HcDeqroM2NvtS5LWyGqmUG4Ednfbu4GbVl+OJGlQgwZ4Af+QZF+S7V3bxqo60m0/BWxc6sIk25PMJpmdm5tbZbmSpHmDPoXy6qo6nOTFwANJvr3wYFVVkiVXxaqqXcAugJmZGVfOkqQhGWgEXlWHu69HgS8AVwFPJ9kE0H09OqoiJUmn6hvgSV6Q5Ofmt4HXA48Ae4Ct3WlbgXtGVaQk6VSDTKFsBL6QZP78v62qLyb5KvCZJNuA7wFvHV2ZkqTF+gZ4VT0JvGKJ9n8HrhtFUZJG79jxk2xYP7Xm12p41vSj9JImx4b1U0zvuG9F1x7cecOQq9FK+FF6SWqUAS5JjTLAJalRBrgkNcoAl6RGGeCS1CgDXJIaZYBLUqMMcElqlAEuSY0ywKUxO3b85FiuVftcC0UaM9ck0Uo5ApekRhngktQoA1ySGmWAS1KjDHBJapQBLkmNMsAlqVEGuCQ1ygCXpEYZ4JLUKANckhplgEtSowxwSWqUAS5JjTLAJalRBrgkNcoAl6RGGeCS1CgDXJIaZYBLUqMMcElq1MABnmQqydeT3Nvtb0nyUJIDSe5K8rzRlSlJWuxMRuC3AvsX7H8Q+HBVvQR4Btg2zMIkSac3UIAn2QzcAHyi2w9wLXB3d8pu4KZRFChJWtqgI/CPAO8FftLtvwh4tqpOdPuHgEuWujDJ9iSzSWbn5uZWVaykyXDs+MmxXq+edf1OSPJm4GhV7UvymjN9g6raBewCmJmZqTOuUNLE2bB+iukd9634+oM7bxhiNeeuvgEOvAp4S5I3ARuAnwc+CpyfZF03Ct8MHB5dmZKkxfpOoVTVbVW1uaqmgVuAL1XV24EHgZu707YC94ysSknSKVbzHPgfAX+Y5AC9OfHbh1OSJC1vNfPnZ9vc+yBTKM+pqi8DX+62nwSuGn5JkrS81cy/n21z734SU5IaZYBLUqMMcElqlAEuSY0ywCWpUQa4JDXKAJe05s6257HH5YyeA5ekYfBZ7uFwBC5JjTLAJalRBrgkNcoAl6RGGeCS1CgDXJIaZYBLUqMMcElqlAEuSY0ywCWpUQa4JDXKAJekRhngktQoA1ySGmWAS1KjDHBJapQBLkmNMsAlqVEGuCQ1ygCXpEYZ4JLUKANckhplgEtSowxwSWqUAS5Jjeob4Ek2JPlKkm8meTTJB7r2LUkeSnIgyV1Jnjf6ciVJ8wYZgf8PcG1VvQK4Arg+ydXAB4EPV9VLgGeAbaMrU5K0WN8Ar57/7HbXd68CrgXu7tp3AzeNpEJJ0pIGmgNPMpXkG8BR4AHgO8CzVXWiO+UQcMky125PMptkdm5ubhg1S5IYMMCr6mRVXQFsBq4CXjboG1TVrqqaqaqZiy++eIVlSpIWO6OnUKrqWeBB4Brg/CTrukObgcNDrk2SdBqDPIVycZLzu+2fBV4H7KcX5Dd3p20F7hlVkZKkU63rfwqbgN1JpugF/meq6t4kjwGfTvInwNeB20dYpyRpkb4BXlUPA1cu0f4kvflwSdIY+ElMSWqUAS5JjTLAJalRBrgkNcoAl6RGGeCS1CgDXJIaZYBLUqMMcElqlAEuSY0ywCWpUQa4pHPGseMnx3LtqAyyGqEknRU2rJ9iesd9K7r24M4bhlzN6jkCl6RGGeCS1CgDXJIaZYBLUqMMcElqlAEuSY0ywCWpUQa4JDXKAJekRhngktQoA1ySGmWASw2bxAWWtHZczEpq2Nm2OJPOjCNwSWqUAS5JjTLAJalRBrgkNcoAl6RGGeCS1CgDXJIa1TfAk1ya5MEkjyV5NMmtXfuFSR5I8kT39YLRlytJmjfICPwE8J6quhy4GnhnksuBHcDeqroM2NvtS5LWSN8Ar6ojVfW1bvvHwH7gEuBGYHd32m7gplEVKUk61RnNgSeZBq4EHgI2VtWR7tBTwMZlrtmeZDbJ7Nzc3CpK1blgNWt7uC6IzjUDr4WS5IXA54B3V9WPkjx3rKoqSS11XVXtAnYBzMzMLHmONM+1PaTBDTQCT7KeXnjfWVWf75qfTrKpO74JODqaEiVJSxnkKZQAtwP7q+pDCw7tAbZ221uBe4ZfniRpOYNMobwKeAfwrSTf6NreB+wEPpNkG/A94K2jKVGStJS+AV5V/wxkmcPXDbccSZpMx46fZMP6qTW/9nT8hw6SNIBJvMHuR+klqVEGuCQ1ygCXpEYZ4JLUKANckhplgEtSowxwCRfRUpt8DlxiMp/xlfpxBC5JjTLAJalRBri0Ss6Ba1ycA5dWaTXz5+AculbOEbgkNcoAl6RGGeCS1CgDXJIaZYBLUqMMcElqlAEuSY0ywCWpUQa4JDXKAJekRhngktQoA1ySGmWAS1KjDHBJapQBLkmNMsAlqVEGuCQ1ygCXpEYZ4JLUKANckhrVN8CTfDLJ0SSPLGi7MMkDSZ7ovl4w2jIlSYsNMgK/A7h+UdsOYG9VXQbs7fYlSWuob4BX1T8B/7Go+UZgd7e9G7hpyHVJkvpY6Rz4xqo60m0/BWwcUj2SpAGt+iZmVRVQyx1Psj3JbJLZubm51b6dtKxjx0+OuwRpTa1b4XVPJ9lUVUeSbAKOLndiVe0CdgHMzMwsG/TSam1YP8X0jvtWdO3BnTcMuRpp9FY6At8DbO22twL3DKccSdKgBnmM8FPAvwAvTXIoyTZgJ/C6JE8Ar+32JUlrqO8USlW9bZlD1w25Fg3RseMn2bB+as2vHcb1kgaz0jlwTbhxzgc7Fy2tDT9KL0mNMsAlqVEGuE7h89RSG5wD1ylWM4cNzmNLa8URuCQ1ygCXpEYZ4JLUKANckhplgEtSowxwSWqUAS5JjTLAJalRBrgkNcoAl6RGGeCS1CgDvI/VLuzkwlCSRsXFrPpwYSdJk8oRuCQ1ygCXpEYZ4BPM+XNJp+Mc+ATznwNLOh1H4JLUKANckhplgEtSowxwSWqUAS5JjTLAJalRBviI+Sy3pFHxOfAR81luSaPiCFySGmWAS1KjDHBJalQzAb6am4HeSJR0NlrVTcwk1wMfBaaAT1TVzqFUtYTV3gz0RqKks82KR+BJpoCPAW8ELgfeluTyYRUmSTq91UyhXAUcqKonq+p/gU8DNw6nLElSP6mqlV2Y3AxcX1W/1+2/A/iNqnrXovO2A9u73ZcCj6+83JG6CPjBuIsYkLWOhrWOTkv1TmKtv1xVFy9uHPkHeapqF7Br1O+zWklmq2pm3HUMwlpHw1pHp6V6W6p1NVMoh4FLF+xv7tokSWtgNQH+VeCyJFuSPA+4BdgznLIkSf2seAqlqk4keRdwP73HCD9ZVY8OrbK1N/HTPAtY62hY6+i0VG8zta74JqYkabya+SSmJOmnGeCS1KhzLsCTXJrkwSSPJXk0ya1d+4VJHkjyRPf1gnHXOi/JVJKvJ7m329+S5KEkB5Lc1d1EnghJzk9yd5JvJ9mf5JpJ7dskf9B9DzyS5FNJNkxK3yb5ZJKjSR5Z0LZkP6bnz7uaH07yygmo9U+774GHk3whyfkLjt3W1fp4kjesZa3L1bvg2HuSVJKLuv2x9m0/51yAAyeA91TV5cDVwDu7JQB2AHur6jJgb7c/KW4F9i/Y/yDw4ap6CfAMsG0sVS3to8AXq+plwCvo1T1xfZvkEuD3gZmqejm9G/G3MDl9ewdw/aK25frxjcBl3Ws78PE1qnHeHZxa6wPAy6vq14F/BW4D6H7WbgF+rbvmL7plOdbSHZxaL0kuBV4P/NuC5nH37elV1Tn9Au4BXkfvE6KburZNwOPjrq2rZTO9H9ZrgXuB0PuU2Lru+DXA/eOus6vlF4Dv0t0cX9A+cX0LXAJ8H7iQ3tNY9wJvmKS+BaaBR/r1I/BXwNuWOm9ctS469lvAnd32bcBtC47dD1wz7r7t2u6mN+g4CFw0KX17ute5OAJ/TpJp4ErgIWBjVR3pDj0FbBxTWYt9BHgv8JNu/0XAs1V1ots/RC+MJsEWYA74627K5xNJXsAE9m1VHQb+jN5o6wjwQ2Afk9u3sHw/zv8ymjdpdf8u8Pfd9kTWmuRG4HBVfXPRoYmsd945G+BJXgh8Dnh3Vf1o4bHq/aod+/OVSd4MHK2qfeOuZUDrgFcCH6+qK4H/YtF0yQT17QX0Fl/bAvwS8AKW+LN6Uk1KP/aT5P30pi3vHHcty0nyfOB9wB+Pu5YzdU4GeJL19ML7zqr6fNf8dJJN3fFNwNFx1bfAq4C3JDlIb7XHa+nNMZ+fZP5DWJO0hMEh4FBVPdTt300v0Cexb18LfLeq5qrqOPB5ev09qX0Ly/fjRC5rkeR3gDcDb+9+4cBk1vqr9H6Rf7P7WdsMfC3JLzKZ9T7nnAvwJAFuB/ZX1YcWHNoDbO22t9KbGx+rqrqtqjZX1TS9Gz9fqqq3Aw8CN3enTUStAFX1FPD9JC/tmq4DHmMC+5be1MnVSZ7ffU/M1zqRfdtZrh/3AL/dPTFxNfDDBVMtY5HeP3t5L/CWqvrvBYf2ALckOS/JFno3B78yjhrnVdW3qurFVTXd/awdAl7ZfT9PXN/+lHFPwq/1C3g1vT89Hwa+0b3eRG9ueS/wBPCPwIXjrnVR3a8B7u22f4XeN/0B4LPAeeOub0GdVwCzXf/+HXDBpPYt8AHg28AjwN8A501K3wKfojc3f5xeoGxbrh/p3dj+GPAd4Fv0nqwZd60H6M0dz/+M/eWC89/f1fo48MZJ6NtFxw/y/zcxx9q3/V5+lF6SGnXOTaFI0tnCAJekRhngktQoA1ySGmWAS1KjDHBJapQBLkmN+j8n4/oUC5D7FAAAAABJRU5ErkJggg==\n" }, "metadata": { "needs_background": "light" } } ] }, { "cell_type": "code", "source": [ "def KMeanClustering(arr,K,eps):\n", "\n", " # No of clusters are equivalent to \n", " # Initialize random Centroids \n", " \n", " n = len(arr)\n", " random_centroids = random.sample(range(1, n), K)\n", " centroid_val = arr[random_centroids,:]\n", " centroids_lst=[]\n", " centroids_lst.append(centroid_val)\n", " clusters_lst = []\n", " diff = 9999\n", " j = 0\n", " while diff > eps:\n", " \n", " ###########################################################\n", " # 1. Code to calculate the Eucledian distance between the centroids \n", " # and all the other observations.\n", " # 2. Assigning observations to centroids with least distance.\n", " ###########################################################\n", "\n", " euclidean_centroid_dist = np.sqrt(np.sum(np.square(arr[:,np.newaxis,:] - centroid_val),axis=2))\n", " assigned_cluster = np.argmin(euclidean_centroid_dist,axis=1).reshape(n,1)\n", " clusters_lst.append(assigned_cluster)\n", " ###########################################################\n", " # 3. Code segment to calculate the new centroids, based on\n", " # on the assignment in the previous \"assigned cluster\"\n", " # ASSIGNMENT.\n", " ###########################################################\n", "\n", " centroid_val_old = centroid_val\n", " centroid_val = np.zeros([K,arr.shape[1]])\n", "\n", " for i in range(0,K):\n", "\n", " cluster = np.where(assigned_cluster==i)[0]\n", " cluster_arr = arr[cluster,:]\n", " centroid_val[i,:] = np.mean(cluster_arr,axis=0)\n", " \n", " \n", " ###########################################################\n", " # 4. Code segment for the exit condition of the while loop\n", " # - Calculate the difference between new and the previous \n", " # centroid, if the difference is below the given eps,\n", " # end the while and return the clusters and the \n", " # cluster centroids.\n", " # - and if not, keep executing the while loop\n", " ###########################################################\n", "\n", " diff = (1/n)*np.sum(np.square(centroid_val_old - centroid_val))\n", " j+=1\n", " \n", " return assigned_cluster,centroid_val,clusters_lst,centroids_lst,j\n" ], "metadata": { "id": "yybuH8I_2GIh" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "arr = np.array(df)\n", "K = 3\n", "eps = 1e-5\n", "\n", "clusters,centroids,clusters_lst,centroids_lst,iter = KMeanClustering(arr,K,eps)\n" ], "metadata": { "id": "z3f9Ld8m4GBL" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "print(arr.shape)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "Mn7XdrC8odWQ", "outputId": "2bfe995c-8d23-46b9-f7e6-e3f22cdea938" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "(381, 2)\n" ] } ] }, { "cell_type": "code", "source": [ "index0 = np.argwhere(clusters == 0).ravel()\n", "print(len(index0))\n", "cluster0 = arr[index0,:]\n", "print(cluster0.shape)\n", "index1 = np.argwhere(clusters == 1).ravel()\n", "cluster1 = arr[index1,:]\n", "\n", "index2 = np.argwhere(clusters == 2).ravel()\n", "cluster2 = arr[index2,:]" ], "metadata": { "id": "_91yH86BoELU", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "0f4d0eef-2652-4287-f7ca-cec871d844a9" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "86\n", "(86, 2)\n" ] } ] }, { "cell_type": "code", "source": [ "plt.scatter(cluster0[:,0],cluster0[:,1],color='red',alpha=0.5,label='Cluster-1')\n", "plt.scatter(cluster1[:,0],cluster1[:,1],color='green',alpha=0.5,label='Cluster-2')\n", "plt.scatter(cluster2[:,0],cluster2[:,1],color='blue',alpha=0.5,label='Cluster-3')\n", "plt.title(\"Outcome of K-Mean clustering algorithm\")\n", "plt.legend()\n", "plt.grid()\n", "plt.show()" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 281 }, "id": "BG6w6yMBgFCS", "outputId": "aca316d8-749f-41b6-ec91-bb81ff28733d" }, "execution_count": null, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "<Figure size 432x288 with 1 Axes>" ], "image/png": "\n" }, "metadata": { "needs_background": "light" } } ] }, { "cell_type": "markdown", "source": [ "# Hierarchical Clustering\n", "\n", "1. Begin with $n$ observations and a measure (such as Euclidean distance) of all the ${n \\choose 2} = n(n-1)/2$ pairwise dissimilarities. Treat each observation as its own cluster.\n", "\n", "2. For $i = n,n-1,\\ldots, 2$:\n", "\n", " (a) Examine all pairwise inter-cluster dissimilarities among the $i$ clusters and identify the pair of clusters that are least dissimilar ( that is, most similar). Fuse these two clusters. The dissimilarity between these two clusters indicates the height in the denogram at which the fusion should be placed. \n", "\n", " (b) Compute the new pairwise inter-cluster dissimilarities among the $i-1$ remaining clusters \n", "\n", "\n", "---------------------------------\n", "\n", "The above mentioned algorithm is simple enough, but one issue has not been addressed. \n", "- We have a concept of the dissimilarity between paris of observations, but how do we define the dissimilarity between two clusters if one or both of the clusters contains multiple observations ?\n", "\n", "- The concept of dissimilarity between a pair of observations needs to be extended to a pair of *groups of observations*. This extension is achieved by developing the notion of *linkage*, which defines the dissimilarity between two groups of observations. \n", "\n", "- The four most common types of linkage - *complete,average,single and centroid*.\n", "\n", "For our implementation, we will consider the **Complete Linkage**\n", "\n", "**Complete Linkage** : Maximal intercluster dissimilarity. Compute all pairwise dissimilarities between the observations in cluster $A$ and the observations in cluster $B$, and record the largest of these dissimilarities. \n" ], "metadata": { "id": "31SiPB9UNXTo" } }, { "cell_type": "markdown", "source": [ "### Example : Agglomerative Hierarchical Clustering\n", "\n", "[Source](https://online.stat.psu.edu/stat555/node/86/)\n", "\n", "Clustering starts by computing a distance between every pair of units that you want to cluster. A distance matrix will be symmetric. The table below is an example of a distance matrix. Only the lower triangle is show, because the upper triangle can be filled in my reflection.\n", "\n", "\\begin{align}\n", "\\begin{array}{|c|} \\hline\n", " & 1 & 2 & 3 & 4 & 5\\\\ \\hline\n", "1 & 0 & & \\\\\n", "2 & 9 & 0 \\\\\n", "3 & 3 & 7 & 0 \\\\ \n", "4 & 6 & 5 & 9 & 0\\\\\n", "5 & 11 & 10 & 2 & 8 & 0\\\\ \\hline\n", "\\end{array}\n", "\\end{align}\n", "\n", "- Now let's start clustering. The smallest distance is between three and five and they get linked up or merged first into a cluster '35'.\n", "\n", "- To obtain the new distance matrix, we need to remove the 3 and 5 entries and replace with by entry '35'.\n", "- Since we are using complete linkage clustering, the distance between '35' and every other item is the maximum of the distance between this item and 3 and this item and 5.\n", "\n", " - for example : $d(1,3) = 3, d(1,5)=11 => D(1,\"35\")=11$\n", " This gives us the distance matrix. The items in with the smallest distance get clustered next.\n", "\n", "\\begin{align}\n", "\\begin{array}{|c|} \\hline\n", " & 35 & 1 & 2& 4 \\\\ \\hline\n", "35 & 0 & & \\\\\n", "1 & 11 & 0 \\\\\n", "2 & 10 & 9 & 0 \\\\ \n", "4 & 9 & 6 & 5 & 0\\\\ \\hline\n", "\\end{array}\n", "\\end{align}\n", "\n", "Similarly\n", "\n", "Now, we combine $2 \\text{ & } 4$\n", "\n", "\\begin{align}\n", "\\begin{array}{|c|} \\hline\n", " & 35 & 24 & 1 \\\\ \\hline\n", "35 & 0 & & \\\\\n", "24 & 10 & 0 \\\\\n", "1 & 11 & 9 & 0 \\\\ \\hline\n", "\\end{array}\n", "\\end{align}\n", "\n", "Now we combine $'24' \\text{ with } 1$\n", "\n", "\\begin{align}\n", "\\begin{array}{|c|} \\hline\n", " & 35 & 241 \\\\ \\hline\n", "35 & 0 & & \\\\\n", "241 & 11 & 0 \\\\ \\hline\n", "\\end{array}\n", "\\end{align}\n", "\n", "The above results is summarized below. On this plot, the y-axis shows the distance between the objects at the time they were clustered. This is called the **Clustered Height**. \n", "\n", "\n", "\n" ], "metadata": { "id": "EV4cCITlduWp" } }, { "cell_type": "markdown", "source": [ "" ], "metadata": { "id": "cpuYVIq_jtnx" } }, { "cell_type": "markdown", "source": [ "**Determining Clusters**\n", "\n", "One of the problems with hierarchical clustering is that there is no objective way to say how many clusters there are. \n", "If we cut the single linkage tree at the point shown, we would say we have two clusters. \n" ], "metadata": { "id": "T09L2-HvjxjY" } }, { "cell_type": "markdown", "source": [ "" ], "metadata": { "id": "DABXLpR6Su5o" } }, { "cell_type": "markdown", "source": [ "However, if we cut the tree lower we might say that there is one cluster and two singletons." ], "metadata": { "id": "LnrdewEJS8mr" } }, { "cell_type": "markdown", "source": [ "" ], "metadata": { "id": "YhT_LApITJQj" } }, { "cell_type": "markdown", "source": [ "\n", "\n" ], "metadata": { "id": "x1dITj_HTKaY" } }, { "cell_type": "code", "source": [ "# to run the code on the data set above and then a little larger data set \n", "\n", "class HierarchicalClustering:\n", "\n", " def __init__(self,arr):\n", " self.arr = arr\n", " self.n = len(self.arr)\n", " self.stagearr = np.zeros([2,self.n])\n", " l = int(np.ceil(self.n/2))\n", " print(\"l\",l)\n", " self.clusters = np.empty(l,dtype=object)\n", " self.clusterCounter = 0\n", " \n", " #def euclideanDist(self)\n", " \n", " def CompleteLinkage(self):\n", "\n", " # call the EuclideanDist Function\n", " # For this example, we are skipping that step\n", "\n", "\n", " self.arr = np.where(self.arr == 0, 100, self.arr)\n", "\n", " for k in range(len(self.arr),2,-1):\n", "\n", " print(\"k\",k)\n", " \n", " \n", " self.pos = np.where(self.arr == np.min(self.arr))[0]\n", "\n", " print(\"self.pos\",self.pos)\n", "\n", " # First calling the cluster function, to save the newly groups \n", " # cluster \n", " \n", " self.storeclusters()\n", "\n", " self.stagearr = self.arr[self.pos,:]\n", "\n", " #2 Delete statement one row and one column\n", " self.arr[self.pos,:] = 100\n", " self.arr[:,self.pos] = 100\n", " print(\"self.arr\",\"\\n\",self.arr)\n", " #self.arr = np.delete(self.arr,self.pos,axis=1)\n", " #self.arr = np.delete(self.arr,self.pos,axis=0)\n", " \n", " # deleting the present cluster indexes from the stage arr\n", " print(\"stagearr\",\"\\n\",self.stagearr)\n", " self.stagearr[:,self.pos] = 100\n", " \n", " \n", " print(\"After placeete\",self.stagearr)\n", " self.arr[self.pos,:] = self.stagearr\n", " self.arr[:,self.pos] = self.stagearr.T\n", " newrow = np.max(self.stagearr,axis=0)\n", " newrow = newrow.reshape(len(self.stagearr[1]),1)\n", "\n", " #self.arr = np.hstack((self.arr, newrow))\n", "\n", " newrow = np.append(newrow,100)\n", " #self.arr = np.vstack((self.arr, newrow.T))\n", " \n", "\n", " print(\"self.arr\")\n", " print(self.arr)\n", " \n", " \n", " def storeclusters(self):\n", " \n", " print(\"we are in clusters \")\n", " is_looping = True\n", "\n", " # When no clusters are created yet.\n", " if self.clusterCounter == 0:\n", " self.clusters[0] = self.pos\n", " is_looping = False\n", "\n", " else:\n", " i = 0\n", " while i < self.clusterCounter:\n", " \n", " for j in range(0,2):\n", " print(\"j0\",j)\n", " print(\"i\",i)\n", " \n", " if self.clusters[i] is None:\n", " self.clusters[i] = self.pos\n", " is_looping = False\n", " break\n", " elif any(x in self.pos for x in self.clusters[i]):\n", " self.clusters[i] = np.append(self.clusters[i],self.pos)\n", " print(\"elIf\", self.clusters )\n", " is_looping = False\n", " break\n", " else:\n", " continue\n", "\n", " if is_looping is False:\n", " break \n", " else:\n", " i+=1\n", " \n", " \n", " # Case when none of the existing clusters contain any of the \n", " # 2 new cluster values.\n", " if is_looping is True:\n", " self.clusters[self.clusterCounter+1] = self.pos\n", " \n", "\n", " self.clusterCounter +=1\n", " print(\"Cluster Counter\",self.clusterCounter)\n", " print(\"Cluster\",self.clusters)\n", " return self.clusters\n", "\n" ], "metadata": { "id": "88MG6dsijudS" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "f = HierarchicalClustering(x)\n", "f.CompleteLinkage()\n", "#f.clusters()" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 426 }, "id": "51XYPvwTm-YV", "outputId": "05e5c49a-0846-42d4-dda6-6b24b3cdd2bc" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "l 13\n", "k 25\n", "self.pos [9]\n", "we are in clusters \n", "Cluster Counter 1\n", "Cluster [array([9]) None None None None None None None None None None None None]\n" ] }, { "output_type": "error", "ename": "IndexError", "evalue": "ignored", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mIndexError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m<ipython-input-33-c130cb51fea9>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0mf\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mHierarchicalClustering\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mf\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mCompleteLinkage\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 3\u001b[0m \u001b[0;31m#f.clusters()\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m<ipython-input-29-acaeabce0c99>\u001b[0m in \u001b[0;36mCompleteLinkage\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m 40\u001b[0m \u001b[0;31m#2 Delete statement one row and one column\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 41\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0marr\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpos\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;36m100\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 42\u001b[0;31m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0marr\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpos\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;36m100\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 43\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"self.arr\"\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\"\\n\"\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0marr\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 44\u001b[0m \u001b[0;31m#self.arr = np.delete(self.arr,self.pos,axis=1)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mIndexError\u001b[0m: index 9 is out of bounds for axis 1 with size 2" ] } ] }, { "cell_type": "code", "source": [ "import random\n", "random.seed(2)\n", "x = np.random.normal(10,5,50).reshape(25,2)\n", "x[0:25,0] = x[0:25,0]+ 3\n", "x[0:25,1] = x[0:25,1]-4\n", "print(x)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "ToWQFpFE6RpK", "outputId": "338d15d5-dc4d-4bc1-b0a2-de0ff8217202" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "[[ 5.34527179 9.4738194 ]\n", " [12.45726243 9.7374621 ]\n", " [11.94022622 12.54273861]\n", " [18.64258696 12.94332958]\n", " [14.15161936 10.44465552]\n", " [ 6.60958318 3.50734521]\n", " [16.57726073 8.24986018]\n", " [15.59865107 6.19294276]\n", " [ 4.85377854 9.50420034]\n", " [13.48193936 -0.30374516]\n", " [24.95740923 4.15751527]\n", " [23.04742754 14.06541976]\n", " [ 8.23589844 13.4393038 ]\n", " [15.08959773 10.02123544]\n", " [ 7.55952291 11.11464924]\n", " [13.06392298 3.96984672]\n", " [16.07720439 4.52031749]\n", " [10.04113041 8.9965649 ]\n", " [12.05649419 6.48268752]\n", " [17.24744721 15.21122072]\n", " [14.23036583 9.28744381]\n", " [ 5.97560551 1.52259464]\n", " [10.92930899 6.17710445]\n", " [15.94983823 3.40697168]\n", " [15.79299311 1.98262685]]\n" ] } ] }, { "cell_type": "code", "source": [ "import scipy.cluster.hierarchy as sch\n", "dendrogram = sch.dendrogram(sch.linkage(x, method='ward'))" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 271 }, "id": "rvkctGrj5UzD", "outputId": "47be6da1-a2cd-4daf-8be2-9f1eeb4ddc8b" }, "execution_count": null, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "<Figure size 432x288 with 1 Axes>" ], "image/png": "\n" }, "metadata": { "needs_background": "light" } } ] }, { "cell_type": "code", "source": [ "import numpy as np\n", "g = np.array([[0,9,3,6,11],\n", " [9,0,7,5,10],\n", " [3,7,0,9,2],\n", " [6,5,9,0,8],\n", " [11,10,2,8,0]])\n" ], "metadata": { "id": "LBEd7kFHk-WC" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "f = [3,5]\n", "t = [[7,8],[4,5],[6,0]]\n", "\n", "any(x in f for x in t[1])\n" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "BV_iy5xEKb4k", "outputId": "e1ed9a8b-4c15-4d42-a006-ae5d6fbb9d29" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "True" ] }, "metadata": {}, "execution_count": 156 } ] }, { "cell_type": "code", "source": [ "import pandas as pd\n", "import numpy as np\n", "import seaborn as sns\n", "import matplotlib.pyplot as plt\n", "\n", "\n", "%cd /content/drive/My\\ Drive/colab_notebooks/machine_learning/data/\n", "df = pd.read_csv(\"Mall_Customers.csv\")" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "3EU9AawtJFiw", "outputId": "61991605-1db2-4f77-ec8f-0488340a4116" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "/content/drive/My Drive/colab_notebooks/machine_learning/data\n" ] } ] }, { "cell_type": "code", "source": [ "df.columns\n", "df = df.loc[:,['Annual Income (k$)','Spending Score (1-100)']]\n", "df" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 423 }, "id": "sgTTXhl_hH_5", "outputId": "5eac8691-05a5-4087-b4db-9edb6e761825" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "\n", " <div id=\"df-37c99dde-fe0a-4590-a5c6-3186f2915241\">\n", " <div class=\"colab-df-container\">\n", " <div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>Annual Income (k$)</th>\n", " <th>Spending Score (1-100)</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>15</td>\n", " <td>39</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>15</td>\n", " <td>81</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>16</td>\n", " <td>6</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>16</td>\n", " <td>77</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>17</td>\n", " <td>40</td>\n", " </tr>\n", " <tr>\n", " <th>...</th>\n", " <td>...</td>\n", " <td>...</td>\n", " </tr>\n", " <tr>\n", " <th>195</th>\n", " <td>120</td>\n", " <td>79</td>\n", " </tr>\n", " <tr>\n", " <th>196</th>\n", " <td>126</td>\n", " <td>28</td>\n", " </tr>\n", " <tr>\n", " <th>197</th>\n", " <td>126</td>\n", " <td>74</td>\n", " </tr>\n", " <tr>\n", " <th>198</th>\n", " <td>137</td>\n", " <td>18</td>\n", " </tr>\n", " <tr>\n", " <th>199</th>\n", " <td>137</td>\n", " <td>83</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "<p>200 rows × 2 columns</p>\n", "</div>\n", " <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-37c99dde-fe0a-4590-a5c6-3186f2915241')\"\n", " title=\"Convert this dataframe to an interactive table.\"\n", " style=\"display:none;\">\n", " \n", " <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n", " width=\"24px\">\n", " <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n", " <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n", " </svg>\n", " </button>\n", " \n", " <style>\n", " .colab-df-container {\n", " display:flex;\n", " flex-wrap:wrap;\n", " gap: 12px;\n", " }\n", "\n", " .colab-df-convert {\n", " background-color: #E8F0FE;\n", " border: none;\n", " border-radius: 50%;\n", " cursor: pointer;\n", " display: none;\n", " fill: #1967D2;\n", " height: 32px;\n", " padding: 0 0 0 0;\n", " width: 32px;\n", " }\n", "\n", " .colab-df-convert:hover {\n", " background-color: #E2EBFA;\n", " box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n", " fill: #174EA6;\n", " }\n", "\n", " [theme=dark] .colab-df-convert {\n", " background-color: #3B4455;\n", " fill: #D2E3FC;\n", " }\n", "\n", " [theme=dark] .colab-df-convert:hover {\n", " background-color: #434B5C;\n", " box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n", " filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n", " fill: #FFFFFF;\n", " }\n", " </style>\n", "\n", " <script>\n", " const buttonEl =\n", " document.querySelector('#df-37c99dde-fe0a-4590-a5c6-3186f2915241 button.colab-df-convert');\n", " buttonEl.style.display =\n", " google.colab.kernel.accessAllowed ? 'block' : 'none';\n", "\n", " async function convertToInteractive(key) {\n", " const element = document.querySelector('#df-37c99dde-fe0a-4590-a5c6-3186f2915241');\n", " const dataTable =\n", " await google.colab.kernel.invokeFunction('convertToInteractive',\n", " [key], {});\n", " if (!dataTable) return;\n", "\n", " const docLinkHtml = 'Like what you see? Visit the ' +\n", " '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n", " + ' to learn more about interactive tables.';\n", " element.innerHTML = '';\n", " dataTable['output_type'] = 'display_data';\n", " await google.colab.output.renderOutput(dataTable, element);\n", " const docLink = document.createElement('div');\n", " docLink.innerHTML = docLinkHtml;\n", " element.appendChild(docLink);\n", " }\n", " </script>\n", " </div>\n", " </div>\n", " " ], "text/plain": [ " Annual Income (k$) Spending Score (1-100)\n", "0 15 39\n", "1 15 81\n", "2 16 6\n", "3 16 77\n", "4 17 40\n", ".. ... ...\n", "195 120 79\n", "196 126 28\n", "197 126 74\n", "198 137 18\n", "199 137 83\n", "\n", "[200 rows x 2 columns]" ] }, "metadata": {}, "execution_count": 125 } ] }, { "cell_type": "markdown", "source": [ "" ], "metadata": { "id": "5q3-dBcjkMno" } }, { "cell_type": "code", "source": [ "import numpy as np\n", "\n", "a = np.array([[1,2,3],[4,6,7],[10,11,12]])\n", "print(a)\n", "g = np.square(a - a[:,np.newaxis,:])\n", "print(g)\n", "h = np.sum(g,axis=2)\n", "print(h)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "4W-qM9CrYsFH", "outputId": "7e8e1465-69fe-41bf-84ea-544a0683bb6c" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "[[ 1 2 3]\n", " [ 4 6 7]\n", " [10 11 12]]\n", "[[[ 0 0 0]\n", " [ 9 16 16]\n", " [81 81 81]]\n", "\n", " [[ 9 16 16]\n", " [ 0 0 0]\n", " [36 25 25]]\n", "\n", " [[81 81 81]\n", " [36 25 25]\n", " [ 0 0 0]]]\n", "[[ 0 41 243]\n", " [ 41 0 86]\n", " [243 86 0]]\n" ] } ] } ] }