{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Density-Based Spatial Culstering of Applications with Noise (DBSCAN)\n",
    "The DBSCAN algorithm is a clustering algorithm which works really well for datasets in which samples conregate in large groups. cuML’s DBSCAN expects a cuDF DataFrame, and constructs an adjacency graph to compute the distances between close neighbours.  The DBSCAN model implemented in the cuML library can accept the following parameters : \n",
    "1. eps: maximum distance between 2 sample points\n",
    "2. min_samples: minimum number of samples that should be present in a neighborhood for it to be considered as a core points.\n",
    "\n",
    "The methods that can be used with DBSCAN are: \n",
    "1. fit: Perform DBSCAN clustering from features.\n",
    "1. fit_predict: Performs clustering on input_gdf and returns cluster labels.\n",
    "1. get_params: Sklearn style return parameter state\n",
    "1. set_params: Sklearn style set parameter state to dictionary of params.\n",
    "\n",
    "The model can take array-like objects, either in host as NumPy arrays or in device (as Numba or _cuda_array_interface_compliant), as well  as cuDF DataFrames. In order to convert your dataset to cudf format please read the cudf documentation on https://rapidsai.github.io/projects/cudf/en/latest/. For additional information on the DBSCAN model please refer to the documentation on https://rapidsai.github.io/projects/cuml/en/latest/index.html"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "from sklearn.cluster import DBSCAN as skDBSCAN\n",
    "from cuml import DBSCAN as cumlDBSCAN\n",
    "import cudf\n",
    "import os"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Helper Functions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# check if the mortgage dataset is present and then extract the data from it, else just create a random dataset for clustering \n",
    "import gzip\n",
    "# change the path of the mortgage dataset if you have saved it in a different directory\n",
    "def load_data(nrows, ncols, cached = 'data/mortgage.npy.gz'):\n",
    "    if os.path.exists(cached):\n",
    "        print('use mortgage data')\n",
    "        with gzip.open(cached) as f:\n",
    "            X = np.load(f)\n",
    "        X = X[np.random.randint(0,X.shape[0]-1,nrows),:ncols]\n",
    "    else:\n",
    "        # create a random dataset\n",
    "        print('use random data')\n",
    "        X = np.random.rand(nrows,ncols)\n",
    "    df = pd.DataFrame({'fea%d'%i:X[:,i] for i in range(X.shape[1])})\n",
    "    return df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# this function checks if the results obtained from two different methods is the same\n",
    "from sklearn.metrics import mean_squared_error\n",
    "def array_equal(a,b,threshold=5e-3,with_sign=True):\n",
    "    a = to_nparray(a)\n",
    "    b = to_nparray(b)\n",
    "    if with_sign == False:\n",
    "        a,b = np.abs(a),np.abs(b)\n",
    "    res = mean_squared_error(a,b)<threshold\n",
    "    return res\n",
    "\n",
    "# the function converts a variable from ndarray or dataframe format to numpy array\n",
    "def to_nparray(x):\n",
    "    if isinstance(x,np.ndarray) or isinstance(x,pd.DataFrame):\n",
    "        return np.array(x)\n",
    "    elif isinstance(x,np.float64):\n",
    "        return np.array([x])\n",
    "    elif isinstance(x,cudf.DataFrame) or isinstance(x,cudf.Series):\n",
    "        return x.to_pandas().values\n",
    "    return x"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Run tests"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "# nrows = number of samples\n",
    "# ncols = number of features of each sample\n",
    "\n",
    "nrows = 5000\n",
    "ncols = 128\n",
    "\n",
    "X = load_data(nrows,ncols)\n",
    "print('data',X.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# eps = maximum distance between 2 sample points for them to be in the same neighborhood\n",
    "# min_samples = number of samples that should be present in a neighborhood for it to be considered as a core point\n",
    "\n",
    "eps = 3\n",
    "min_samples = 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "# use the sklearn DBSCAN model to fit the dataset \n",
    "clustering_sk = skDBSCAN(eps = eps, min_samples = min_samples)\n",
    "clustering_sk.fit(X)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "# convert the pandas dataframe to cudf format\n",
    "X = cudf.DataFrame.from_pandas(X)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "# run the cuml DBSCAN model to fit the dataset \n",
    "clustering_cuml = cumlDBSCAN(eps = eps, min_samples = min_samples)\n",
    "clustering_cuml.fit(X)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# check if the output of the sklearn model and the cuml model are equal or not\n",
    "passed = array_equal(clustering_sk.labels_,clustering_cuml.labels_)\n",
    "message = 'compare dbscan: cuml vs sklearn labels_ %s'%('equal'if passed else 'NOT equal')\n",
    "print(message)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}