{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ ">### 🚩 *Create a free WhyLabs account to get more value out of whylogs!*
\n", ">*Did you know you can store, visualize, and monitor whylogs profiles with the [WhyLabs Observability Platform](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=Monitor_Embeddings)? Sign up for a [free WhyLabs account](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=Monitor_Embeddings) to leverage the power of whylogs and WhyLabs together!*" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Monitoring Text Embeddings with the 20 Newsgroups Dataset\n", "\n", "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/whylabs/whylogs/blob/mainline/python/examples/tutorials/Monitoring_Embeddings.ipynb)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "In this example, we will show how to use whylogs and WhyLabs to monitor text data. We will use the [20 Newsgroups dataset](https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html) to train a classifier and monitor different aspects of the pipeline: we will monitor high dimensional embeddings, the list of tokens, and also the performance of the classifier itself. We will also inject an anomaly to see how we can detect it with WhyLabs. We will translate an increasingly large portion of documents from English to Spanish and see how the model performance degrades and how the embeddings change.\n", "\n", "To monitor the embeddings, we will first calculate a number of meaningful reference points. In this case, that means embeddings that represent each document topic. To do so, we will use the labeled data in our training dataset to calculate the centroids in PCA space for each topic. We will then compare each data point of a given batch to this set of reference embeddings. That way, we can calculate distribution metrics according to the distance of each reference point.\n", "\n", "We can also tokenize each document and monitor the list of tokens. This is useful to detect changes in the vocabulary of the dataset, number of tokens per document, and other statistics.\n", "\n", "Finally, we will monitor the performance of the classifier. We will log both predictions and labels for each batch, so we can calculate metrics such as accuracy, precision, recall, and F1 score." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## What we'll cover in this tutorial\n", "\n", "We will divide this example in two stages: Pre-deployment Stage and Production Stage.\n", "\n", "In the __Pre-deployment Stage__ we will:\n", "- train a classifier\n", "- calculate the centroids for each topic cluster\n", "\n", "In the __Production Stage__ we will:\n", "- load daily batches of data\n", "- vectorize the data\n", "- predict the topic for each document\n", "- log:\n", " - embeddings distance to the centroids\n", " - tokens list for each document\n", " - predictions and targets\n", "\n", "In the Production Stage, we will introduce documents in another language (Spanish) to see how the model behaves, and how we can monitor this with WhyLabs." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Installing Dependencies" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# Note: you may need to restart the kernel to use updated packages.\n", "%pip install whylogs scikit-learn==1.0.2" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## ✔️ Setting the Environment Variables\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import getpass\n", "import os\n", "\n", "# set your org-id here - should be something like \"org-xxxx\"\n", "print(\"Enter your WhyLabs Org ID\") \n", "os.environ[\"WHYLABS_DEFAULT_ORG_ID\"] = input()\n", "\n", "# set your datased_id (or model_id) here - should be something like \"model-xxxx\"\n", "print(\"Enter your WhyLabs Dataset ID\")\n", "os.environ[\"WHYLABS_DEFAULT_DATASET_ID\"] = input()\n", "\n", "\n", "# set your API key here\n", "print(\"Enter your WhyLabs API key\")\n", "os.environ[\"WHYLABS_API_KEY\"] = getpass.getpass()\n", "print(\"Using API Key ID: \", os.environ[\"WHYLABS_API_KEY\"][0:10])" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Pre-deployment" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Training the model" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import fetch_20newsgroups\n", "from sklearn.feature_extraction.text import TfidfTransformer\n", "from sklearn.feature_extraction.text import CountVectorizer\n", "from sklearn.pipeline import Pipeline\n", "import numpy as np\n", "import pandas as pd\n", "from whylogs.experimental.preprocess.embeddings.selectors import PCACentroidsSelector\n", "from sklearn.naive_bayes import MultinomialNB" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "We will extract TF-IDF vectors to train our classifier. We will later use the same transform pipeline to generate embeddings in the production stage." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "MultinomialNB(alpha=0.01)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "categories = [\n", " \"alt.atheism\",\n", " \"soc.religion.christian\",\n", " \"comp.graphics\",\n", " \"rec.sport.baseball\",\n", " \"talk.politics.guns\",\n", " \"misc.forsale\",\n", " \"sci.med\",\n", "]\n", "\n", "twenty_train = fetch_20newsgroups(\n", " subset=\"train\", remove=(\"headers\", \"footers\", \"quotes\"), categories=categories, shuffle=True, random_state=42\n", ")\n", "\n", "vectorizer = Pipeline(\n", " [\n", " (\"vect\", CountVectorizer()),\n", " (\"tfidf\", TfidfTransformer()),\n", " ]\n", ")\n", "vectors_train = vectorizer.fit_transform(twenty_train.data)\n", "\n", "vectors_train = vectors_train.toarray()\n", "\n", "clf = MultinomialNB(alpha=0.01)\n", "clf.fit(vectors_train, twenty_train.target)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Calculating Reference Embeddings\n", "\n", "If we have labels for our data, selecting the centroids of clusters for each label makes sense. We provide a helper class, `PCACentroidSelector`, that finds the centroids in PCA space before converting back to the original dimensional space. The number of components should be high enough to capture enough information about the clusters, but not so high that it becomes computationally expensive. In this example, let's use 20 components.\n", "\n", "Let's utilize the labels available in the dataset for determining our references." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['atheism', 'graphics', 'forsale', 'baseball', 'med', 'christian', 'guns']\n" ] } ], "source": [ "references, labels = PCACentroidsSelector(n_components=20).calculate_references(vectors_train, twenty_train.target)\n", "ref_labels = [twenty_train.target_names[x].split(\".\")[-1] for x in labels]\n", "print(ref_labels)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Production Stage" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Configuring Schema for Embeddings+Tokens+Performance logging\n", "\n", "By default, whylogs will calculate standard metrics. For this example, we'll be using specialized metrics such as the `EmbeddingMetrics` and `BagofWordsMetrics`, so we need to create a custom schema. Let's do that:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "import whylogs as why\n", "from whylogs.core.resolvers import MetricSpec, ResolverSpec\n", "from whylogs.core.schema import DeclarativeSchema\n", "from whylogs.experimental.extras.embedding_metric import (\n", " DistanceFunction,\n", " EmbeddingConfig,\n", " EmbeddingMetric,\n", ")\n", "from whylogs.experimental.extras.nlp_metric import BagOfWordsMetric\n", "from whylogs.core.resolvers import STANDARD_RESOLVER\n", "\n", "\n", "config = EmbeddingConfig(\n", " references=references,\n", " labels=ref_labels,\n", " distance_fn=DistanceFunction.cosine,\n", ")\n", "embeddings_resolver = ResolverSpec(column_name=\"news_centroids\", metrics=[MetricSpec(EmbeddingMetric, config)])\n", "tokens_resolver = ResolverSpec(column_name=\"document_tokens\", metrics=[MetricSpec(BagOfWordsMetric)])\n", "\n", "embedding_schema = DeclarativeSchema(STANDARD_RESOLVER+[embeddings_resolver])\n", "token_schema = DeclarativeSchema(STANDARD_RESOLVER+[tokens_resolver])" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Loading daily batches" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "To speed things up, let's download the production data from a public S3 bucket. That way, we won't have to translate or tokenize the documents ourselves.\n", "\n", "The DataFrame below contains 5306 documents - 2653 in English and 2653 in Spanish. The spanish documents were obtained by simply translating the english ones. Documents that have the same `doc_id` refers to the same document in different languages.\n", "\n", "The tokenization was done using the `nltk` library." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
doctargetpredictedtokenslanguagebatch_iddoc_id
0Hello\\n\\n Just one quick question\\n ...44[Hello, Just, one, quick, question, My, father...en00.0
1OFFICIAL UNITED NATIONS SOUVENIR FOLDERS\\n\\nEa...22[OFFICIAL, UNITED, NATIONS, SOUVENIR, FOLDERS,...en01.0
2I am selling Joe Montana SportsTalk Football 9...22[I, selling, Joe, Montana, SportsTalk, Footbal...en02.0
3\\n\\nNonsteroid Proventil is a brand of albute...44[Nonsteroid, Proventil, brand, albuterol, bron...en03.0
4Two URGENT requests\\n\\n1 I need the latest upd...66[Two, URGENT, requests, 1, I, need, latest, up...en04.0
\n", "
" ], "text/plain": [ " doc target predicted \\\n", "0 Hello\\n\\n Just one quick question\\n ... 4 4 \n", "1 OFFICIAL UNITED NATIONS SOUVENIR FOLDERS\\n\\nEa... 2 2 \n", "2 I am selling Joe Montana SportsTalk Football 9... 2 2 \n", "3 \\n\\nNonsteroid Proventil is a brand of albute... 4 4 \n", "4 Two URGENT requests\\n\\n1 I need the latest upd... 6 6 \n", "\n", " tokens language batch_id \\\n", "0 [Hello, Just, one, quick, question, My, father... en 0 \n", "1 [OFFICIAL, UNITED, NATIONS, SOUVENIR, FOLDERS,... en 0 \n", "2 [I, selling, Joe, Montana, SportsTalk, Footbal... en 0 \n", "3 [Nonsteroid, Proventil, brand, albuterol, bron... en 0 \n", "4 [Two, URGENT, requests, 1, I, need, latest, up... en 0 \n", "\n", " doc_id \n", "0 0.0 \n", "1 1.0 \n", "2 2.0 \n", "3 3.0 \n", "4 4.0 " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "download_url = \"https://whylabs-public.s3.us-west-2.amazonaws.com/whylogs_examples/Newsgroups/production_en_es.parquet\"\n", "prod_df = pd.read_parquet(download_url)\n", "prod_df.head()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Language Perturbation - Spanish Documents" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Now, we just need to define the ratio of translated documents. We will log 7 batches of data, each representing a day. The firs 4 days will be unaltered, so we have a normal behavior to compare to. The last 3 days will have an increasing number of Spanish documents - 33%, 66%, and 100%.\n", "\n", "Let's just define a function that will get the data for a given batch and desired ratio of Spanish documents:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "language_perturbation_ratio = [0,0,0,0,0.33,0.66,1]\n", "\n", "def get_docs_by_language_ratio(batch_df, ratio):\n", " n_docs = len(batch_df[batch_df[\"language\"] == \"en\"])\n", " n_es_docs = int(n_docs * ratio)\n", " n_en_docs = n_docs - n_es_docs\n", " en_df = batch_df[batch_df[\"language\"] == \"en\"].sample(n_en_docs) \n", " es_df = batch_df[~batch_df['doc_id'].isin(en_df[\"doc_id\"])]\n", " # filter out docs with doc_id in en_df\n", "\n", "\n", " es_df = es_df[es_df[\"language\"] == \"es\"].sample(n_es_docs)\n", " docs = pd.concat([en_df, es_df])\n", " return docs" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Log and Upload to WhyLabs\n", "\n", "We now have everything we need to log our data and send it to WhyLabs. We will vectorize the data to log the embeddings distances, and also get the tokens list for each document to log the tokens distribution. We will also predict each document's topic with our trained classifier and log the predictions and targets, so we have access to performance metrics at our dashboard." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "day 0: 2023-03-28 00:00:00+00:00\n", "0% of documents with language perturbation\n", "mean accuracy: 0.8537313432835821\n", "Profiling and logging Embeddings and Tokens...\n", "Profiling and logging classification metrics...\n", "day 1: 2023-03-29 00:00:00+00:00\n", "0% of documents with language perturbation\n", "mean accuracy: 0.8333333333333334\n", "Profiling and logging Embeddings and Tokens...\n", "Profiling and logging classification metrics...\n", "day 2: 2023-03-30 00:00:00+00:00\n", "0% of documents with language perturbation\n", "mean accuracy: 0.8475609756097561\n", "Profiling and logging Embeddings and Tokens...\n", "Profiling and logging classification metrics...\n", "day 3: 2023-03-31 00:00:00+00:00\n", "0% of documents with language perturbation\n", "mean accuracy: 0.811377245508982\n", "Profiling and logging Embeddings and Tokens...\n", "Profiling and logging classification metrics...\n", "day 4: 2023-04-01 00:00:00+00:00\n", "33.0% of documents with language perturbation\n", "mean accuracy: 0.648876404494382\n", "Profiling and logging Embeddings and Tokens...\n", "Profiling and logging classification metrics...\n", "day 5: 2023-04-02 00:00:00+00:00\n", "66.0% of documents with language perturbation\n", "mean accuracy: 0.49244712990936557\n", "Profiling and logging Embeddings and Tokens...\n", "Profiling and logging classification metrics...\n", "day 6: 2023-04-03 00:00:00+00:00\n", "100% of documents with language perturbation\n", "mean accuracy: 0.286144578313253\n", "Profiling and logging Embeddings and Tokens...\n", "Profiling and logging classification metrics...\n" ] } ], "source": [ "from datetime import datetime,timedelta, timezone\n", "import whylogs as why\n", "from whylogs.api.writer.whylabs import WhyLabsWriter\n", "import random\n", "\n", "writer = WhyLabsWriter()\n", "\n", "for day, batch_df in prod_df.groupby(\"batch_id\"):\n", " batch_df['tokens'] = batch_df['tokens'].apply(lambda x: x.tolist())\n", "\n", " dataset_timestamp = datetime.now() - timedelta(days=6-day)\n", " dataset_timestamp = dataset_timestamp.replace(hour=0, minute=0, second=0, microsecond=0, tzinfo = timezone.utc)\n", "\n", " print(f\"day {day}: {dataset_timestamp}\")\n", "\n", " ratio = language_perturbation_ratio[day]\n", " print(f\"{ratio*100}% of documents with language perturbation\")\n", " \n", " mixed_df = get_docs_by_language_ratio(batch_df, ratio)\n", " mixed_df = mixed_df.dropna()\n", "\n", " sample_ratio = random.uniform(0.8, 1) # just to have some variability in the total number of daily docs\n", " mixed_df = mixed_df.sample(frac=sample_ratio).reset_index(drop=True)\n", "\n", "\n", " vectors = vectorizer.transform(mixed_df['doc']).toarray()\n", " predicted = clf.predict(vectors)\n", " print(\"mean accuracy: \", np.mean(predicted == mixed_df['target']))\n", "\n", " print(\"Profiling and logging Embeddings and Tokens...\")\n", " embeddings_profile = why.log(row={\"news_centroids\": vectors},\n", " schema=embedding_schema)\n", " embeddings_profile.set_dataset_timestamp(dataset_timestamp)\n", " writer.write(file=embeddings_profile.view())\n", "\n", " tokens_df = pd.DataFrame({\"document_tokens\":mixed_df[\"tokens\"]})\n", " \n", " tokens_profile = why.log(tokens_df, schema=token_schema)\n", " tokens_profile.set_dataset_timestamp(dataset_timestamp)\n", " writer.write(file=tokens_profile.view()) \n", "\n", "\n", " newsgroups_df = pd.DataFrame({\"output_target\": mixed_df[\"target\"],\n", " \"output_prediction\": predicted})\n", " # to map indices to label names\n", " newsgroups_df[\"output_target\"] = newsgroups_df[\"output_target\"].apply(lambda x: ref_labels[x])\n", " newsgroups_df[\"output_prediction\"] = newsgroups_df[\"output_prediction\"].apply(lambda x: ref_labels[x])\n", "\n", " \n", " print(\"Profiling and logging classification metrics...\")\n", " classification_results = why.log_classification_metrics(\n", " newsgroups_df,\n", " target_column=\"output_target\",\n", " prediction_column=\"output_prediction\",\n", " log_full_data=True\n", " )\n", " classification_results.set_dataset_timestamp(dataset_timestamp) \n", " writer.write(file=classification_results.view())" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## WhyLabs Dashboard\n", "\n", "You should now have access to a number of different features in your dashboard that represent the different aspects of the pipeline we monitored:\n", "\n", "- `news_centroids`: Relative distance of each document to the centroids of each reference topic cluster, and frequent items for the closest centroid for each document\n", "- `document_tokens`: Distribution of tokens (term length, document length and frequent items) in each document\n", "- `output_prediction` and `output_target`: The output (predictions and targets) of the classifier that will also be used to compute metrics on the `Performance` tab\n", "\n", "We won't cover everything here for the sake of brevity, but let's take a look at some of the charts.\n", "\n", "With the monitored information, we should be able to correlate the anomalies and reach a conclusion about what happened.\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "#### news_centroids.closest\n", "\n", "In the chart below, we can see the distribution of the closest centroid for each document. We can see for the first 4 days, the distribution is similar between each other. The language perturbations injected in the last 3 days seem to skew the distribution towards the `forsale` topic." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "![alt text](images/closest.png)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "#### document_tokens.frequent_terms\n", "\n", "Since we remove the english stopwords in our tokenization process, but we don't remove the spanish stopwords, we can see that most of the most frequent terms in the selected period are the spanish stopwords, and that those stopwords don't appear in the first 4 days." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "![alt text](images/tokens_items.png)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "#### Performance.F1\n", "\n", "In the `Performance` tab, there are plenty of information that tell us that our performance is degrading. For example, the F1 chart below shows that the model is getting increasingly worse starting from the 5th day." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "![alt text](images/text_f1.png)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## References\n", "\n", "This tutorial covers different aspects of whylogs and WhyLabs. Here are some blog posts and examples that relates to this tutorial:\n", "\n", "- [The 20 newsgroups text dataset](https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html)\n", "- [How to Troubleshoot Embeddings Without Eye-balling t-SNE or UMAP Plots](https://whylabs.ai/blog/posts/how-to-troubleshoot-embeddings-without-eye-balling-tsne-or-umap-plots)\n", "- [Embeddings Distance Logging](https://nbviewer.org/github/whylabs/whylogs/blob/mainline/python/examples/experimental/Embeddings_Distance_Logging.ipynb)\n", "- [NLP Summarization](https://nbviewer.org/github/whylabs/whylogs/blob/mainline/python/examples/experimental/NLP_Summarization.ipynb)\n", "- [Classification Metrics](https://nbviewer.org/github/whylabs/whylogs/blob/mainline/python/examples/integrations/writers/Writing_Classification_Performance_Metrics_to_WhyLabs.ipynb)\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.11.2 64-bit", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.2" }, "orig_nbformat": 4, "vscode": { "interpreter": { "hash": "b0fa6594d8f4cbf19f97940f81e996739fb7646882a419484c72d19e05852a7e" } } }, "nbformat": 4, "nbformat_minor": 2 }