{ "cells": [ { "cell_type": "markdown", "metadata": { "tags": [ "s1", "content", "l1" ] }, "source": [ "# DBScan\n", "\n", "## DBScan\n", "\n", "DBSCAN has the ability to capture densely packed data points. It is similar to KNNs with variable parameters.\n", "\n", "### Algorithm:\n", "\n", "1) Choose two parameters, a positive numbers - epsilon and an integer minPoints.\n", "\n", "2) Randomly pick few points from the dataset. If there are greater than minPoints within a radius of epsilon from that point, we consider all of them to be part of a \"cluster\". Note that eucledian distance is used here.\n", "\n", "3) Expand that cluster by checking all of the unchecked new points and seeing if they have greater than minPoints within a radius of epsilon, growing the cluster recursively in this manner.\n", "\n", "4) Eventually, we run out of points to add to the cluster. We then pick a new arbitrary point and repeat the process.\n", "\n", "5) At the end of clustering, we could end up with data points not belonging to any cluster that we call noise. \n", "\n", "The figure shows how 3 points which are lying within the ball of radius $\\epsilon$ are considered neighbors that satisfies the criterion for the point to be considered a member of the cluster. The point is then tagged as visited. Later the ball is moved to each neighbor point and the same member determination continues till all points are visited.\n", "\n", "\n", "\n", "Ref: https://en.wikipedia.org/wiki/DBSCAN\n", "\n", "## DBScan using sklearn\n", "\n", "We will have to specify epsilon and a natural number minPoints. Let us use values\n", "epsilon = 0.09\n", "and minPoints = 5.\n", "\n", "```python\n", "db = DBSCAN(eps=0.06, min_samples=5)\n", "db.fit(X)\n", "labels = db.labels_\n", "```\n", "
\n", "## Exercise:\n", " \n", " - Perform DBScan on the dataset and visualize the clusters" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true, "tags": [ "s1", "ce", "l1" ] }, "outputs": [], "source": [ "from matplotlib import pyplot as plt\n", "from sklearn.datasets import make_moons\n", "from sklearn.cluster import DBSCAN\n", "\n", "import numpy as np\n", "import seaborn as sns\n", "import pandas as pd\n", "\n", "plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots\n", "plt.rcParams['image.interpolation'] = 'nearest'\n", "plt.rcParams['image.cmap'] = 'gray'\n", "\n", "N_Samples = 1000\n", "D = 2\n", "K = 4\n", "\n", "X, y = make_moons(n_samples = 2*N_Samples, noise=0.05, shuffle = False)\n", "x_vec, y_vec = make_moons(n_samples = 2*N_Samples, noise=0.08, shuffle = False)\n", "x_vec[:,0] += 2.5\n", "y_vec += 2\n", "X = np.concatenate((X, x_vec), axis=0)\n", "y = np.concatenate((y, y_vec), axis=0)\n", "\n", "# Create a dataframe moon_df and visualize a graph g\n", "moon_df = pd.DataFrame({'X_0':X[:,0],'X_1':X[:,1], 'y':y})\n", "g = sns.pairplot(x_vars=\"X_0\", y_vars=\"X_1\", hue=\"y\", data = moon_df)\n", "g.fig.set_size_inches(14, 6)\n", "sns.despine()\n", "\n", "# Initialize a DBScan cluster with eps and minimum points\n", "db = DBSCAN(eps=0.06, min_samples=5)\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "tags": [ "s1", "l1", "hint" ] }, "source": [ "

Use .fit and .labels_ for fitting the dbscan and accessing the cluster labels respectively.

" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "tags": [ "s1", "l1", "ans" ] }, "outputs": [], "source": [ "db.fit(X)\n", "labels = db.labels_\n", "\n", "# Create a data frame and visualize the plot.\n", "moon_df['db_clus'] = labels\n", "g=sns.pairplot(x_vars=\"X_0\", y_vars=\"X_1\", hue = \"db_clus\", data = moon_df)\n", "g.fig.set_size_inches(14, 6)\n", "sns.despine()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "tags": [ "s1", "hid", "l1" ] }, "outputs": [], "source": [ "ref_tmp_var = False\n", "\n", "\n", "try:\n", " ref_assert_var = False\n", " db_ = DBSCAN(eps=0.06, min_samples=5)\n", " db_.fit(X)\n", " \n", " import numpy as np\n", " \n", " if np.all(moon_df['db_clus'] == db_.labels_):\n", " ref_assert_var = True\n", " out = g\n", " else:\n", " ref_assert_var = False\n", " \n", "except Exception:\n", " print('Please follow the instructions given and use the same variables provided in the instructions.')\n", "else:\n", " if ref_assert_var:\n", " ref_tmp_var = True\n", " else:\n", " print('Please follow the instructions given and use the same variables provided in the instructions.')\n", "\n", "\n", "assert ref_tmp_var" ] }, { "cell_type": "markdown", "metadata": { "tags": [ "l2", "content", "s2" ] }, "source": [ "\n", "


\n", "## Clustering Shapes\n", "\n", "DBScan uses density based clustering and though it appears to determine various shapes, not always is it successful. Consider two intertwined circles with a high noise of 0.2. This dataset can be generated by make_circles function in the sklearn.datasets using a noise of 0.2.\n", "\n", "\n", "\n", "
\n", "## Exercise:\n", "\n", " - Use DBScan with eps of 0.5 to cluster the dataset and plot the results.\n", " - Assign the cluster labels to the variable labels." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true, "tags": [ "l2", "ce", "s2" ] }, "outputs": [], "source": [ "from sklearn.datasets import make_circles\n", "\n", "# Generate the dataset with intertwined circles\n", "X, y = make_circles(n_samples=N_Samples, factor=.5, noise=.2)\n", "noisy_circles = pd.DataFrame({'X_0':X[:,0],'X_1':X[:,1], 'y':y})\n", "\n", "# Perform DBScan on the dataset and visualize the results\n", "db = DBSCAN(eps=0.5, min_samples=5)\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "tags": [ "l2", "s2", "hint" ] }, "source": [ "

Transform the data into a dataframe and plot the result with a hue as the cluster labels.

" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "tags": [ "l2", "s2", "ans" ] }, "outputs": [], "source": [ "db.fit(X)\n", "labels = db.labels_\n", "noisy_circles = pd.DataFrame({'X_0':X[:,0], 'X_1':X[:,1], 'y':labels})\n", "g = sns.pairplot(x_vars=\"X_0\", y_vars=\"X_1\", hue = \"y\", data = noisy_circles)\n", "g.fig.set_size_inches(14, 6)\n", "sns.despine()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "tags": [ "l2", "hid", "s2" ] }, "outputs": [], "source": [ "ref_tmp_var = False\n", "\n", "\n", "try:\n", " ref_assert_var = False\n", " import numpy as np\n", " \n", " db_ = DBSCAN(eps=0.5, min_samples=5)\n", " db_.fit(X)\n", " \n", " if np.all(db_.labels_ == labels):\n", " ref_assert_var = True\n", " out = g\n", " else:\n", " ref_assert_var = False\n", " \n", "except Exception:\n", " print('Please follow the instructions given and use the same variables provided in the instructions.')\n", "else:\n", " if ref_assert_var:\n", " ref_tmp_var = True\n", " else:\n", " print('Please follow the instructions given and use the same variables provided in the instructions.')\n", "\n", "\n", "assert ref_tmp_var" ] } ], "metadata": { "executed_sections": [], "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.0" }, "rf_version": 1 }, "nbformat": 4, "nbformat_minor": 2 }