{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Nearest Neighbors\n",
    "\n",
    "Nearest Neighbors enables the query of the k-nearest neighbors from a set of input samples."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The model can take array-like objects, either in host as NumPy arrays or in device (as Numba or cuda_array_interface-compliant), as well as cuDF DataFrames as the input. \n",
    "\n",
    "For information on converting your dataset to cuDF format, refer to the cuDF documentation: https://docs.rapids.ai/api/cudf/stable\n",
    "\n",
    "For additional information on cuML's Nearest Neighbors implementation: https://rapidsai.github.io/projects/cuml/en/stable/api.html#nearest-neighbors"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import cudf\n",
    "import numpy as np\n",
    "from cuml.datasets import make_blobs\n",
    "from cuml.neighbors import NearestNeighbors as cuNearestNeighbors\n",
    "from sklearn.neighbors import NearestNeighbors as skNearestNeighbors"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Define Parameters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "n_samples = 2**17\n",
    "n_features = 40\n",
    "\n",
    "n_query = 2**13\n",
    "n_neighbors = 4\n",
    "random_state = 0"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generate Data\n",
    "\n",
    "### GPU"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "device_data, _ = make_blobs(n_samples=n_samples,\n",
    "                            n_features=n_features,\n",
    "                            centers=5,\n",
    "                            random_state=random_state)\n",
    "\n",
    "device_data = cudf.DataFrame.from_gpu_matrix(device_data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Copy dataset from GPU memory to host memory.\n",
    "# This is done to later compare CPU and GPU results.\n",
    "host_data = device_data.to_pandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Scikit-learn Model\n",
    "\n",
    "## Fit"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "knn_sk = skNearestNeighbors(algorithm=\"brute\",\n",
    "                            n_jobs=-1)\n",
    "knn_sk.fit(host_data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "D_sk, I_sk = knn_sk.kneighbors(host_data[:n_query], n_neighbors)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## cuML Model\n",
    "\n",
    "### Fit"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "knn_cuml = cuNearestNeighbors()\n",
    "knn_cuml.fit(device_data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "D_cuml, I_cuml = knn_cuml.kneighbors(device_data[:n_query], n_neighbors)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Compare Results\n",
    "\n",
    "cuML currently uses FAISS for exact nearest neighbors search, which limits inputs to single-precision. This results in possible round-off errors when floats of different magnitude are added. As a result, it's very likely that the cuML results will not match Sciklearn's nearest neighbors exactly. You can read more in the [FAISS wiki](https://github.com/facebookresearch/faiss/wiki/FAQ#why-do-i-get-weird-results-with-brute-force-search-on-vectors-with-large-components).\n",
    "\n",
    "### Distances"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "passed = np.allclose(D_sk, D_cuml.as_gpu_matrix(), atol=1e-3)\n",
    "print('compare knn: cuml vs sklearn distances %s'%('equal'if passed else 'NOT equal'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Indices"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sk_sorted = np.sort(I_sk, axis=1)\n",
    "cuml_sorted = np.sort(I_cuml.as_gpu_matrix(), axis=1)\n",
    "\n",
    "diff = sk_sorted - cuml_sorted\n",
    "\n",
    "passed = (len(diff[diff!=0]) / n_samples) < 1e-9\n",
    "print('compare knn: cuml vs sklearn indexes %s'%('equal'if passed else 'NOT equal'))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}