{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Density-Based Spatial Culstering of Applications with Noise (DBSCAN)\n", "The DBSCAN algorithm is a clustering algorithm which works really well for datasets in which samples conregate in large groups. cuML’s DBSCAN expects a cuDF DataFrame, and constructs an adjacency graph to compute the distances between close neighbours. The DBSCAN model implemented in the cuML library can accept the following parameters : \n", "1. eps: maximum distance between 2 sample points\n", "2. min_samples: minimum number of samples that should be present in a neighborhood for it to be considered as a core points.\n", "\n", "The methods that can be used with DBSCAN are: \n", "1. fit: Perform DBSCAN clustering from features.\n", "1. fit_predict: Performs clustering on input_gdf and returns cluster labels.\n", "1. get_params: Sklearn style return parameter state\n", "1. set_params: Sklearn style set parameter state to dictionary of params.\n", "\n", "The model can take array-like objects, either in host as NumPy arrays or in device (as Numba or _cuda_array_interface_compliant), as well as cuDF DataFrames. In order to convert your dataset to cudf format please read the cudf documentation on https://rapidsai.github.io/projects/cudf/en/latest/. For additional information on the DBSCAN model please refer to the documentation on https://rapidsai.github.io/projects/cuml/en/latest/index.html" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "from sklearn.cluster import DBSCAN as skDBSCAN\n", "from cuml import DBSCAN as cumlDBSCAN\n", "import cudf\n", "import os" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Helper Functions" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# check if the mortgage dataset is present and then extract the data from it, else just create a random dataset for clustering \n", "import gzip\n", "# change the path of the mortgage dataset if you have saved it in a different directory\n", "def load_data(nrows, ncols, cached = 'data/mortgage.npy.gz'):\n", " if os.path.exists(cached):\n", " print('use mortgage data')\n", " with gzip.open(cached) as f:\n", " X = np.load(f)\n", " X = X[np.random.randint(0,X.shape[0]-1,nrows),:ncols]\n", " else:\n", " # create a random dataset\n", " print('use random data')\n", " X = np.random.rand(nrows,ncols)\n", " df = pd.DataFrame({'fea%d'%i:X[:,i] for i in range(X.shape[1])})\n", " return df" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# this function checks if the results obtained from two different methods is the same\n", "from sklearn.metrics import mean_squared_error\n", "def array_equal(a,b,threshold=5e-3,with_sign=True):\n", " a = to_nparray(a)\n", " b = to_nparray(b)\n", " if with_sign == False:\n", " a,b = np.abs(a),np.abs(b)\n", " res = mean_squared_error(a,b)