{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## KNN\n", "K NearestNeighbors is a unsupervised algorithm where if one wants to find the “closest” datapoint(s) to new unseen data, one can calculate a suitable “distance” between each and every point, and return the top K datapoints which have the smallest distance to it.\n", "\n", "cuML’s KNN expects a cuDF DataFrame or a Numpy Array (where automatic chunking will be done in to a Numpy Array in a future release), and fits a special data structure first to approximate the distance calculations, allowing our querying times to be O(plogn) and not the brute force O(np) [where p = no(features)]:\n", "\n", "The KNN function accepts the following parameters:\n", "1. n_neighbors: int (default = 5). The top K closest datapoints you want the algorithm to return. If this number is large, then expect the algorithm to run slower.\n", "1. should_downcast:bool (default = False). Currently only single precision is supported in the underlying undex. Setting this to true will allow single-precision input arrays to be automatically downcasted to single precision. Default = False.\n", "\n", "The methods that can be used with KNN are:\n", "1. fit: Fit GPU index for performing nearest neighbor queries.\n", "1. kneighbors: Query the GPU index for the k nearest neighbors of row vectors in X.\n", "\n", "The model can take array-like objects, either in host as NumPy arrays or in device (as Numba or _cuda_array_interface_compliant), as well as cuDF DataFrames. In order to convert your dataset to cudf format please read the cudf documentation on https://rapidsai.github.io/projects/cudf/en/latest/. For additional information on the K NearestNeighbors model please refer to the documentation on https://rapidsai.github.io/projects/cuml/en/latest/api.html#nearest-neighbors" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import cudf\n", "import os\n", "\n", "from sklearn.neighbors import NearestNeighbors as skKNN\n", "from cuml.neighbors.nearest_neighbors import NearestNeighbors as cumlKNN" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Helper Functions" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# check if the mortgage dataset is present and then extract the data from it, else just create a random dataset for clustering \n", "import gzip\n", "# change the path of the mortgage dataset if you have saved it in a different directory\n", "def load_data(nrows, ncols, cached = 'data/mortgage.npy.gz',source='mortgage'):\n", " if os.path.exists(cached) and source=='mortgage':\n", " print('use mortgage data')\n", " with gzip.open(cached) as f:\n", " X = np.load(f)\n", " X = X[np.random.randint(0,X.shape[0]-1,nrows),:ncols]\n", " else:\n", " # create a random dataset\n", " print('use random data')\n", " X = np.random.random((nrows,ncols)).astype('float32')\n", " df = pd.DataFrame({'fea%d'%i:X[:,i] for i in range(X.shape[1])}).fillna(0)\n", " return df" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import mean_squared_error\n", "# this function checks if the results obtained from two different methods (sklearn and cuml) are the same\n", "def array_equal(a,b,threshold=1e-3,with_sign=True,metric='mse'):\n", " a = to_nparray(a)\n", " b = to_nparray(b)\n", " if with_sign == False:\n", " a,b = np.abs(a),np.abs(b)\n", " if metric=='mse':\n", " error = mean_squared_error(a,b)\n", " res = errorthreshold]) == 0\n", " elif metric == 'acc':\n", " error = np.sum(a!=b)/(a.shape[0]*a.shape[1])\n", " res = error1]) / (c.shape[0]*c.shape[1])\n", " return c