{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
">### 🚩 *Create a free WhyLabs account to get more value out of whylogs!*
\n",
">*Did you know you can store, visualize, and monitor whylogs profiles with the [WhyLabs Observability Platform](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=Monitor_Embeddings)? Sign up for a [free WhyLabs account](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=Monitor_Embeddings) to leverage the power of whylogs and WhyLabs together!*"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Monitoring Text Embeddings with the 20 Newsgroups Dataset\n",
"\n",
"[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/whylabs/whylogs/blob/mainline/python/examples/tutorials/Monitoring_Embeddings.ipynb)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"In this example, we will show how to use whylogs and WhyLabs to monitor text data. We will use the [20 Newsgroups dataset](https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html) to train a classifier and monitor different aspects of the pipeline: we will monitor high dimensional embeddings, the list of tokens, and also the performance of the classifier itself. We will also inject an anomaly to see how we can detect it with WhyLabs. We will translate an increasingly large portion of documents from English to Spanish and see how the model performance degrades and how the embeddings change.\n",
"\n",
"To monitor the embeddings, we will first calculate a number of meaningful reference points. In this case, that means embeddings that represent each document topic. To do so, we will use the labeled data in our training dataset to calculate the centroids in PCA space for each topic. We will then compare each data point of a given batch to this set of reference embeddings. That way, we can calculate distribution metrics according to the distance of each reference point.\n",
"\n",
"We can also tokenize each document and monitor the list of tokens. This is useful to detect changes in the vocabulary of the dataset, number of tokens per document, and other statistics.\n",
"\n",
"Finally, we will monitor the performance of the classifier. We will log both predictions and labels for each batch, so we can calculate metrics such as accuracy, precision, recall, and F1 score."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## What we'll cover in this tutorial\n",
"\n",
"We will divide this example in two stages: Pre-deployment Stage and Production Stage.\n",
"\n",
"In the __Pre-deployment Stage__ we will:\n",
"- train a classifier\n",
"- calculate the centroids for each topic cluster\n",
"\n",
"In the __Production Stage__ we will:\n",
"- load daily batches of data\n",
"- vectorize the data\n",
"- predict the topic for each document\n",
"- log:\n",
" - embeddings distance to the centroids\n",
" - tokens list for each document\n",
" - predictions and targets\n",
"\n",
"In the Production Stage, we will introduce documents in another language (Spanish) to see how the model behaves, and how we can monitor this with WhyLabs."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Installing Dependencies"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# Note: you may need to restart the kernel to use updated packages.\n",
"%pip install whylogs scikit-learn==1.0.2"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## ✔️ Setting the Environment Variables\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import getpass\n",
"import os\n",
"\n",
"# set your org-id here - should be something like \"org-xxxx\"\n",
"print(\"Enter your WhyLabs Org ID\") \n",
"os.environ[\"WHYLABS_DEFAULT_ORG_ID\"] = input()\n",
"\n",
"# set your datased_id (or model_id) here - should be something like \"model-xxxx\"\n",
"print(\"Enter your WhyLabs Dataset ID\")\n",
"os.environ[\"WHYLABS_DEFAULT_DATASET_ID\"] = input()\n",
"\n",
"\n",
"# set your API key here\n",
"print(\"Enter your WhyLabs API key\")\n",
"os.environ[\"WHYLABS_API_KEY\"] = getpass.getpass()\n",
"print(\"Using API Key ID: \", os.environ[\"WHYLABS_API_KEY\"][0:10])"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Pre-deployment"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Training the model"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.datasets import fetch_20newsgroups\n",
"from sklearn.feature_extraction.text import TfidfTransformer\n",
"from sklearn.feature_extraction.text import CountVectorizer\n",
"from sklearn.pipeline import Pipeline\n",
"import numpy as np\n",
"import pandas as pd\n",
"from whylogs.experimental.preprocess.embeddings.selectors import PCACentroidsSelector\n",
"from sklearn.naive_bayes import MultinomialNB"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"We will extract TF-IDF vectors to train our classifier. We will later use the same transform pipeline to generate embeddings in the production stage."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"MultinomialNB(alpha=0.01)"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"categories = [\n",
" \"alt.atheism\",\n",
" \"soc.religion.christian\",\n",
" \"comp.graphics\",\n",
" \"rec.sport.baseball\",\n",
" \"talk.politics.guns\",\n",
" \"misc.forsale\",\n",
" \"sci.med\",\n",
"]\n",
"\n",
"twenty_train = fetch_20newsgroups(\n",
" subset=\"train\", remove=(\"headers\", \"footers\", \"quotes\"), categories=categories, shuffle=True, random_state=42\n",
")\n",
"\n",
"vectorizer = Pipeline(\n",
" [\n",
" (\"vect\", CountVectorizer()),\n",
" (\"tfidf\", TfidfTransformer()),\n",
" ]\n",
")\n",
"vectors_train = vectorizer.fit_transform(twenty_train.data)\n",
"\n",
"vectors_train = vectors_train.toarray()\n",
"\n",
"clf = MultinomialNB(alpha=0.01)\n",
"clf.fit(vectors_train, twenty_train.target)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Calculating Reference Embeddings\n",
"\n",
"If we have labels for our data, selecting the centroids of clusters for each label makes sense. We provide a helper class, `PCACentroidSelector`, that finds the centroids in PCA space before converting back to the original dimensional space. The number of components should be high enough to capture enough information about the clusters, but not so high that it becomes computationally expensive. In this example, let's use 20 components.\n",
"\n",
"Let's utilize the labels available in the dataset for determining our references."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['atheism', 'graphics', 'forsale', 'baseball', 'med', 'christian', 'guns']\n"
]
}
],
"source": [
"references, labels = PCACentroidsSelector(n_components=20).calculate_references(vectors_train, twenty_train.target)\n",
"ref_labels = [twenty_train.target_names[x].split(\".\")[-1] for x in labels]\n",
"print(ref_labels)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Production Stage"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Configuring Schema for Embeddings+Tokens+Performance logging\n",
"\n",
"By default, whylogs will calculate standard metrics. For this example, we'll be using specialized metrics such as the `EmbeddingMetrics` and `BagofWordsMetrics`, so we need to create a custom schema. Let's do that:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"import whylogs as why\n",
"from whylogs.core.resolvers import MetricSpec, ResolverSpec\n",
"from whylogs.core.schema import DeclarativeSchema\n",
"from whylogs.experimental.extras.embedding_metric import (\n",
" DistanceFunction,\n",
" EmbeddingConfig,\n",
" EmbeddingMetric,\n",
")\n",
"from whylogs.experimental.extras.nlp_metric import BagOfWordsMetric\n",
"from whylogs.core.resolvers import STANDARD_RESOLVER\n",
"\n",
"\n",
"config = EmbeddingConfig(\n",
" references=references,\n",
" labels=ref_labels,\n",
" distance_fn=DistanceFunction.cosine,\n",
")\n",
"embeddings_resolver = ResolverSpec(column_name=\"news_centroids\", metrics=[MetricSpec(EmbeddingMetric, config)])\n",
"tokens_resolver = ResolverSpec(column_name=\"document_tokens\", metrics=[MetricSpec(BagOfWordsMetric)])\n",
"\n",
"embedding_schema = DeclarativeSchema(STANDARD_RESOLVER+[embeddings_resolver])\n",
"token_schema = DeclarativeSchema(STANDARD_RESOLVER+[tokens_resolver])"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Loading daily batches"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"To speed things up, let's download the production data from a public S3 bucket. That way, we won't have to translate or tokenize the documents ourselves.\n",
"\n",
"The DataFrame below contains 5306 documents - 2653 in English and 2653 in Spanish. The spanish documents were obtained by simply translating the english ones. Documents that have the same `doc_id` refers to the same document in different languages.\n",
"\n",
"The tokenization was done using the `nltk` library."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n", " | doc | \n", "target | \n", "predicted | \n", "tokens | \n", "language | \n", "batch_id | \n", "doc_id | \n", "
---|---|---|---|---|---|---|---|
0 | \n", "Hello\\n\\n Just one quick question\\n ... | \n", "4 | \n", "4 | \n", "[Hello, Just, one, quick, question, My, father... | \n", "en | \n", "0 | \n", "0.0 | \n", "
1 | \n", "OFFICIAL UNITED NATIONS SOUVENIR FOLDERS\\n\\nEa... | \n", "2 | \n", "2 | \n", "[OFFICIAL, UNITED, NATIONS, SOUVENIR, FOLDERS,... | \n", "en | \n", "0 | \n", "1.0 | \n", "
2 | \n", "I am selling Joe Montana SportsTalk Football 9... | \n", "2 | \n", "2 | \n", "[I, selling, Joe, Montana, SportsTalk, Footbal... | \n", "en | \n", "0 | \n", "2.0 | \n", "
3 | \n", "\\n\\nNonsteroid Proventil is a brand of albute... | \n", "4 | \n", "4 | \n", "[Nonsteroid, Proventil, brand, albuterol, bron... | \n", "en | \n", "0 | \n", "3.0 | \n", "
4 | \n", "Two URGENT requests\\n\\n1 I need the latest upd... | \n", "6 | \n", "6 | \n", "[Two, URGENT, requests, 1, I, need, latest, up... | \n", "en | \n", "0 | \n", "4.0 | \n", "