{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Nearest Neighbors\n", "\n", "Nearest Neighbors enables the query of the k-nearest neighbors from a set of input samples." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The model can take array-like objects, either in host as NumPy arrays or in device (as Numba or cuda_array_interface-compliant), as well as cuDF DataFrames as the input. \n", "\n", "For information on converting your dataset to cuDF format, refer to the cuDF documentation: https://docs.rapids.ai/api/cudf/stable\n", "\n", "For additional information on cuML's Nearest Neighbors implementation: https://rapidsai.github.io/projects/cuml/en/stable/api.html#nearest-neighbors" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import cudf\n", "import numpy as np\n", "from cuml.datasets import make_blobs\n", "from cuml.neighbors import NearestNeighbors as cuNearestNeighbors\n", "from sklearn.neighbors import NearestNeighbors as skNearestNeighbors" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Define Parameters" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "n_samples = 2**17\n", "n_features = 40\n", "\n", "n_query = 2**13\n", "n_neighbors = 4\n", "random_state = 0" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Generate Data\n", "\n", "### GPU" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "device_data, _ = make_blobs(n_samples=n_samples,\n", " n_features=n_features,\n", " centers=5,\n", " random_state=random_state)\n", "\n", "device_data = cudf.DataFrame.from_gpu_matrix(device_data)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Copy dataset from GPU memory to host memory.\n", "# This is done to later compare CPU and GPU results.\n", "host_data = device_data.to_pandas()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Scikit-learn Model\n", "\n", "## Fit" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "knn_sk = skNearestNeighbors(algorithm=\"brute\",\n", " n_jobs=-1)\n", "knn_sk.fit(host_data)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "D_sk, I_sk = knn_sk.kneighbors(host_data[:n_query], n_neighbors)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## cuML Model\n", "\n", "### Fit" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "knn_cuml = cuNearestNeighbors()\n", "knn_cuml.fit(device_data)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "D_cuml, I_cuml = knn_cuml.kneighbors(device_data[:n_query], n_neighbors)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Compare Results\n", "\n", "cuML currently uses FAISS for exact nearest neighbors search, which limits inputs to single-precision. This results in possible round-off errors when floats of different magnitude are added. As a result, it's very likely that the cuML results will not match Sciklearn's nearest neighbors exactly. You can read more in the [FAISS wiki](https://github.com/facebookresearch/faiss/wiki/FAQ#why-do-i-get-weird-results-with-brute-force-search-on-vectors-with-large-components).\n", "\n", "### Distances" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "passed = np.allclose(D_sk, D_cuml.as_gpu_matrix(), atol=1e-3)\n", "print('compare knn: cuml vs sklearn distances %s'%('equal'if passed else 'NOT equal'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Indices" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sk_sorted = np.sort(I_sk, axis=1)\n", "cuml_sorted = np.sort(I_cuml.as_gpu_matrix(), axis=1)\n", "\n", "diff = sk_sorted - cuml_sorted\n", "\n", "passed = (len(diff[diff!=0]) / n_samples) < 1e-9\n", "print('compare knn: cuml vs sklearn indexes %s'%('equal'if passed else 'NOT equal'))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.7" } }, "nbformat": 4, "nbformat_minor": 4 }