{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    ">### 🚩 *Create a free WhyLabs account to get more value out of whylogs!*<br> \n",
    ">*Did you know you can store, visualize, and monitor whylogs profiles with the [WhyLabs Observability Platform](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=Embeddings)? Sign up for a [free WhyLabs account](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=Embeddings) to leverage the power of whylogs and WhyLabs together!*"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Logging Generic Embeddings Data using Reference Distances"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/whylabs/whylogs/blob/mainline/python/examples/experimental/embeddings/Embeddings_Distance_Logging.ipynb)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "High dimensional embedding spaces can be difficult to understand because we often rely on our own subjective judgement of clusters in the space. Often, data scientists try to find issues solely by hovering over individual data points and noting trends in which ones feel out of place.\n",
    "\n",
    "In whylogs, you are able to profile embeddings values by comparing them to reference data points. These references can be completely determined by users (helpful when they represent prototypical \"ideal\" representations of a cluster or scenario) but can also be chosen programmatically."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup\n",
    "\n",
    "### Install package extras for whylogs\n",
    "\n",
    "For convenience, we include helper functions to select reference data points for comparing new embedding vectors against. To follow this notebook in full, install the `embeddings` extra (for helper functions) and `viz` extra (for visualizing drift) when installing whylogs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Note: you may need to restart the kernel to use updated packages.\n",
    "%pip install --upgrade whylogs[embeddings,viz]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## MNIST dataset\n",
    "\n",
    "### Downloading from OpenML\n",
    "\n",
    "We'll use the 784-dimensional MNIST dataset as our example. This can be downloaded from OpenML via scikit-learn. Because the download can take a few minutes, we suggest saving the data locally as well."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import pickle\n",
    "from sklearn.datasets import fetch_openml\n",
    "\n",
    "if os.path.exists(\"mnist_784_X_y.pkl\"):\n",
    "    X, y = pickle.load(open(\"mnist_784_X_y.pkl\", 'rb'))\n",
    "else:\n",
    "    X, y = fetch_openml(\"mnist_784\", version=1, return_X_y=True, as_frame=False)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Splitting into training and production datasets\n",
    "\n",
    "Instead of training a model, we'll use the same functionality to split our dataset into an original training dataset and data we'll see in our first day of production."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "X_train, X_prod, y_train, y_prod = train_test_split(X, y, test_size=0.1)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Finding references\n",
    "\n",
    "We would like to compare incoming embeddings against up to 30 predefined references. These can chosen by the user either manually or algorithmically. Both reference selection algorithms provided are conducted on raw data, but only for the purposes of finding references itself.\n",
    "\n",
    "#### Manual selection\n",
    "\n",
    "If we had prototypical examples of digits that we wanted to compare our incoming data against, we would collect those data points now.\n",
    "\n",
    "#### Algorithmic selection for labeled data\n",
    "\n",
    "If we have labels for our data, selecting the centroids of clusters for each label makes sense. We provide a helper class, `PCACentroidSelector`, that finds the centroids in PCA space before converting back to the raw 784-dimensional space.\n",
    "\n",
    "Let's utilize the labels available in the dataset for determining our references."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from whylogs.experimental.preprocess.embeddings.selectors import PCACentroidsSelector\n",
    "\n",
    "references, labels = PCACentroidsSelector(n_components=20).calculate_references(X_train, y_train)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Algorithmic selection for unlabeled data\n",
    "\n",
    "If we have labels for our data, selecting the centroids of clusters for each label makes sense. We provide a helper class, `PCAKMeansSelector`, that finds the unsupervised centroids in PCA space then converting back to raw space.\n",
    "\n",
    "We'll also calculate these but will elect to use the supervised version for the rest of the notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from whylogs.experimental.preprocess.embeddings.selectors import PCAKMeansSelector\n",
    "\n",
    "unsup_references, unsup_labels = PCAKMeansSelector(n_clusters=8, n_components=20).calculate_references(X_train, y_train)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Profiling with whylogs\n",
    "\n",
    "As with other advanced features, we can create a `DeclarativeSchema` to tell whylogs to resolve columns of a certain name to the `EmbeddingMetric` that we want to use.\n",
    "\n",
    "We must pass our references, labels, and preferred distance function (either cosine distance or Euclidean distance) as parameters to `EmbeddingConfig` then log as normal."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import whylogs as why\n",
    "from whylogs.core.resolvers import MetricSpec, ResolverSpec\n",
    "from whylogs.core.schema import DeclarativeSchema\n",
    "from whylogs.experimental.extras.embedding_metric import (\n",
    "    DistanceFunction,\n",
    "    EmbeddingConfig,\n",
    "    EmbeddingMetric,\n",
    ")\n",
    "\n",
    "config = EmbeddingConfig(\n",
    "    references=references,\n",
    "    labels=labels,\n",
    "    distance_fn=DistanceFunction.euclidean,\n",
    ")\n",
    "schema = DeclarativeSchema([ResolverSpec(column_name=\"pixel_values\", metrics=[MetricSpec(EmbeddingMetric, config)])])\n",
    "\n",
    "train_profile = why.log(row={\"pixel_values\": X_train}, schema=schema)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's confirm the contents of our profile measures the distribution of embeddings relative to the references we've provided."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_profile_view = train_profile.view()\n",
    "column = train_profile_view.get_column(\"pixel_values\")\n",
    "summary = column.to_summary_dict()\n",
    "for digit in [str(i) for i in range(10)]:\n",
    "    mean = summary[f'embedding/{digit}_distance:distribution/mean']\n",
    "    stddev = summary[f'embedding/{digit}_distance:distribution/stddev']\n",
    "    print(f\"{digit} distance: mean {mean}   stddev {stddev}\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Measuring embeddings drift in WhyLabs\n",
    "\n",
    "This distance approach can be really powerful for measuring drift across new batches of embeddings in a programmatic way using drift metrics as well as the WhyLabs Observability Platform.\n",
    "\n",
    "We'll look at a single example where an engineer introduces a change to reduce the amount of unnecessary processing by filtering out images where more than 90% of pixels are zeros. This is a realistic cleaning step that might be added to an ML pipeline, but will have a detrimental impact on our incoming data, especially the 1s."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Find which digits have more than or equal to 90% missing\n",
    "not_empty_mask = (X_prod == 0).sum(axis=1) <= (0.9 * 784)\n",
    "X_prod = X_prod[not_empty_mask]\n",
    "y_prod = y_prod[not_empty_mask]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Log production digits using the same schema\n",
    "prod_profile_view = why.log(row={\"pixel_values\": X_prod}, schema=schema).profile().view()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's look at this using inside of the whylogs profile view objects:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_profile_summary = train_profile_view.get_column(\"pixel_values\").to_summary_dict()\n",
    "prod_profile_summary = prod_profile_view.get_column(\"pixel_values\").to_summary_dict()\n",
    "for digit in [str(i) for i in range(10)]:\n",
    "    mean_diff = train_profile_summary[f'embedding/{digit}_distance:distribution/mean'] - prod_profile_summary[f'embedding/{digit}_distance:distribution/mean']\n",
    "    stddev_diff = train_profile_summary[f'embedding/{digit}_distance:distribution/stddev'] - prod_profile_summary[f'embedding/{digit}_distance:distribution/stddev']\n",
    "    print(f\"{digit} distance difference (target-prod): mean {mean_diff}   stddev {stddev_diff}\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This particular drift has shown up in the distances to our reference data points as we'd expect. In particular, the 1s seem most affected by our rule.\n",
    "\n",
    "## What's Next?\n",
    "\n",
    "### Upload profiles to WhyLabs for more drift calculations and monitoring\n",
    "\n",
    "See [example notebook](https://whylogs.readthedocs.io/en/stable/examples/integrations/writers/Writing_to_WhyLabs.html) for monitoring your profiles continuously with the WhyLabs Observability Platform.\n",
    "\n",
    "### Exploring other sources of drift\n",
    "\n",
    "Consider comparing this profile to different transformations and subsets of our MNIST dataset: randomly selected subsets of the data, normalized values, missing one or more labels, sorted values, and more.\n",
    "\n",
    "### More example notebooks and documentation\n",
    "\n",
    "Go to the [examples page](https://whylogs.readthedocs.io/en/stable/examples.html) for the complete list of examples!"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.2"
  },
  "orig_nbformat": 4,
  "vscode": {
   "interpreter": {
    "hash": "5b95c30731e012b4373e323ecbfabd3cd166cfbf0d931d39f7996a9595914c45"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}