{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# UMAP Demo with Graphs\n",
    "\n",
    "[UMAP](https://umap-learn.readthedocs.io/en/latest/) is a powerful dimensionality reduction tool which NVIDIA recently ported to GPUs with a python interface.  In this notebook we will demostrate basic usage, plotting, and timing of the unsupervised CUDA (GPU) version of UMAP. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Imports and Set Up"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "# libraries for scoring/clustering\n",
    "from sklearn.manifold.t_sne import trustworthiness\n",
    "\n",
    "# GPU UMAP\n",
    "import cudf\n",
    "from cuml.manifold.umap import UMAP as cumlUMAP\n",
    "\n",
    "# plotting\n",
    "import seaborn as sns\n",
    "import matplotlib as mpl\n",
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline\n",
    "sns.set(style='white', rc={'figure.figsize':(25, 12.5)})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# hide warnings\n",
    "import warnings\n",
    "warnings.filterwarnings(\"ignore\", category=DeprecationWarning) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Sanity Checks\n",
    "\n",
    "We are going to work with the [fashion mnist](https://github.com/zalandoresearch/fashion-mnist) data set.  This is a dataset consisting of 70,000 28x28 grayscale images of clothing.  It should already be in the `data/fashion` folder, but let's do a sanity check!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if not os.path.exists('data/fashion'):\n",
    "    print(\"error, data is missing!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's make sure we have our RAPIDS compliant GPU.  It must be Pascal or higher!  You can also use this to define which GPU RAPIDS should use (advanced feature not covered here)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!nvidia-smi"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Helper Functions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# https://github.com/zalandoresearch/fashion-mnist/blob/master/utils/mnist_reader.py\n",
    "def load_mnist(path, kind='train'):\n",
    "    import os\n",
    "    import gzip\n",
    "    import numpy as np\n",
    "\n",
    "    \"\"\"Load MNIST data from `path`\"\"\"\n",
    "    labels_path = os.path.join(path,\n",
    "                               '%s-labels-idx1-ubyte.gz'\n",
    "                               % kind)\n",
    "    images_path = os.path.join(path,\n",
    "                               '%s-images-idx3-ubyte.gz'\n",
    "                               % kind)\n",
    "\n",
    "    with gzip.open(labels_path, 'rb') as lbpath:\n",
    "        labels = np.frombuffer(lbpath.read(), dtype=np.uint8,\n",
    "                               offset=8)\n",
    "\n",
    "    with gzip.open(images_path, 'rb') as imgpath:\n",
    "        images = np.frombuffer(imgpath.read(), dtype=np.uint8,\n",
    "                               offset=16).reshape(len(labels), 784)\n",
    "\n",
    "    return images, labels"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train, train_labels = load_mnist('data/fashion', kind='train')\n",
    "test, test_labels = load_mnist('data/fashion', kind='t10k')\n",
    "data = np.array(np.vstack([train, test]), dtype=np.float64) / 255.0\n",
    "target = np.array(np.hstack([train_labels, test_labels]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are 60000 training images and 10000 test images"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "f\"Train shape: {train.shape} and Test Shape: {test.shape}\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train[0].shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As mentioned previously, each row in the train matrix is an image"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# display a Nike? sneaker\n",
    "pixels = train[0].reshape((28, 28))\n",
    "plt.imshow(pixels, cmap='gray')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There is cost with moving data between host memory and device memory (GPU memory) and we will include that core when comparing speeds"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "record_data = (('fea%d'%i, data[:,i]) for i in range(data.shape[1]))\n",
    "gdf = cudf.DataFrame(record_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`gdf` is a GPU backed dataframe -- all the data is stored in the device memory of the GPU.  With the data converted, we can apply the `cumlUMAP` the same inputs as we do for the standard UMAP.  Additionally, it should be noted that within cuml, [FAISS] https://github.com/facebookresearch/faiss) is used for extremely fast kNN and it's limited to single precision.  `cumlUMAP` will automatically downcast to `float32` when needed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%timeit\n",
    "g_embedding = cumlUMAP(n_neighbors=5, init=\"spectral\").fit_transform(gdf)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Visualization\n",
    "\n",
    "OK, now let's plot the output of the embeddings so that we can see the seperation of the neighborhoods.  Let's start by creating the classes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "classes = [\n",
    "    'T-shirt/top',\n",
    "    'Trouser',\n",
    "    'Pullover',\n",
    "    'Dress',\n",
    "    'Coat',\n",
    "    'Sandal',\n",
    "    'Shirt',\n",
    "    'Sneaker',\n",
    "    'Bag',\n",
    "    'Ankle boot']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Needs to be redone because of timeit function sometimes loses our g_embedding variable\n",
    "g_embedding = cumlUMAP(n_neighbors=5, init=\"spectral\").fit_transform(gdf)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Just as the original author of UMAP, Leland McInnes, states in the [UMAP docs](https://umap-learn.readthedocs.io/en/latest/supervised.html), we can plot the results and show the separation between the various classes defined above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "g_embedding_numpy = g_embedding.to_pandas().values #it is necessary to convert to numpy array to do the visual mapping\n",
    "\n",
    "fig, ax = plt.subplots(1, figsize=(14, 10))\n",
    "plt.scatter(g_embedding_numpy[:,1], g_embedding_numpy[:,0], s=0.3, c=target, cmap='Spectral', alpha=1.0)\n",
    "plt.setp(ax, xticks=[], yticks=[])\n",
    "cbar = plt.colorbar(boundaries=np.arange(11)-0.5)\n",
    "cbar.set_ticks(np.arange(10))\n",
    "cbar.set_ticklabels(classes)\n",
    "plt.title('Fashion MNIST Embedded via cumlUMAP');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Additionally, we can also quanititaviely compare the perfomance of `cumlUMAP` (GPU UMAP) to the reference/original implementation (CPU UMAP) using the [trustworthiness score](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/manifold/t_sne.py#L395).  From the docstring:\n",
    "\n",
    "> Trustworthiness expresses to what extent the local structure is retained.  The trustworthiness is within [0, 1].\n",
    "\n",
    "\n",
    "Like `t-SNE`, UMAP tries to capture both global and local structure and thus, we can apply the `trustworthiness` of the `g_embedding` data against the original input.  With a higher score we are demonstrating that the algorithm does a better and better job of local structure retention.  As [Corey Nolet](https://github.com/cjnolet) notes:\n",
    "> Algorithms like UMAP aim to preserve local neighborhood structure and so measuring this property (trustworthiness) measures the algorithm's performance."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Scoring ~97% shows the GPU implementation is comparable to the original CPU implementation and the training time was ~9.5X faster"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}