{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "provenance": [] }, "kernelspec": { "name": "python3", "display_name": "Python 3" }, "language_info": { "name": "python" } }, "cells": [ { "cell_type": "markdown", "source": [ ">### 🚩 *Create a free WhyLabs account to get more value out of whylogs!*
\n", ">*Did you know you can store, visualize, and monitor whylogs profiles with the [WhyLabs Observability Platform](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=Schema_Configuration)? Sign up for a [free WhyLabs account](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=Schema_Configuration) to leverage the power of whylogs and WhyLabs together!*" ], "metadata": { "id": "5_hazICzT0AX" } }, { "cell_type": "markdown", "source": [ "# Document Summarization Example" ], "metadata": { "id": "phJi2VWRUEio" } }, { "cell_type": "markdown", "source": [ "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/whylabs/whylogs/blob/mainline/python/examples/basic/Schema_Configuration.ipynb)" ], "metadata": { "id": "t_TgL10MUrSZ" } }, { "cell_type": "markdown", "source": [ "In this example, we'll look at how we might use whylogs to monitor a document summarization task.\n", "\n", "We'll use [NLTK](https://www.nltk.org) and [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) to do some of the basic NLP tasks, so let's install the packages we'll need now." ], "metadata": { "id": "rH4hheWfUt7-" } }, { "cell_type": "code", "execution_count": 1, "metadata": { "id": "yw9ErM4ZTpZo", "outputId": "46b881df-76f6-452f-b2d0-3ade1e4e3b4c", "colab": { "base_uri": "https://localhost:8080/" } }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n", "Requirement already satisfied: nltk in /usr/local/lib/python3.8/dist-packages (3.7)\n", "Requirement already satisfied: joblib in /usr/local/lib/python3.8/dist-packages (from nltk) (1.2.0)\n", "Requirement already satisfied: click in /usr/local/lib/python3.8/dist-packages (from nltk) (8.1.3)\n", "Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.8/dist-packages (from nltk) (2022.6.2)\n", "Requirement already satisfied: tqdm in /usr/local/lib/python3.8/dist-packages (from nltk) (4.64.1)\n", "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n", "Requirement already satisfied: bs4 in /usr/local/lib/python3.8/dist-packages (0.0.1)\n", "Requirement already satisfied: beautifulsoup4 in /usr/local/lib/python3.8/dist-packages (from bs4) (4.6.3)\n", "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n", "Processing ./whylogs-1.1.28-py3-none-any.whl\n", "Collecting whylogs-sketching>=3.4.1.dev3\n", " Downloading whylogs_sketching-3.4.1.dev3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (548 kB)\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m548.0/548.0 KB\u001b[0m \u001b[31m9.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25hRequirement already satisfied: protobuf>=3.19.4 in /usr/local/lib/python3.8/dist-packages (from whylogs==1.1.28) (3.19.6)\n", "Requirement already satisfied: typing-extensions>=3.10 in /usr/local/lib/python3.8/dist-packages (from whylogs==1.1.28) (4.5.0)\n", "Requirement already satisfied: scikit-learn<2.0.0,>=1.0.2 in /usr/local/lib/python3.8/dist-packages (from whylogs==1.1.28) (1.2.1)\n", "Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.8/dist-packages (from scikit-learn<2.0.0,>=1.0.2->whylogs==1.1.28) (1.2.0)\n", "Requirement already satisfied: numpy>=1.17.3 in /usr/local/lib/python3.8/dist-packages (from scikit-learn<2.0.0,>=1.0.2->whylogs==1.1.28) (1.22.4)\n", "Requirement already satisfied: scipy>=1.3.2 in /usr/local/lib/python3.8/dist-packages (from scikit-learn<2.0.0,>=1.0.2->whylogs==1.1.28) (1.10.1)\n", "Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.8/dist-packages (from scikit-learn<2.0.0,>=1.0.2->whylogs==1.1.28) (3.1.0)\n", "Installing collected packages: whylogs-sketching, whylogs\n", "Successfully installed whylogs-1.1.28 whylogs-sketching-3.4.1.dev3\n" ] } ], "source": [ "%pip install nltk\n", "%pip install bs4\n", "%pip install whylogs[embeddings]" ] }, { "cell_type": "markdown", "source": [ "We'll use the NLTK Reuters corpus as the documents to summarize. As a trivial summarization algorithm, we'll pull out the sentence that contains a document's highest log-entropy weighted term as its summary. Let's start by computing the term-frequency index for the corpus and the term global frequencies and entropies. We'll use NLTK's stemming, stopping, and tokenization for those calcuations, but return the unaltered sentence as the summary." ], "metadata": { "id": "c1D0VHR4YsAC" } }, { "cell_type": "code", "source": [ "from typing import Any, Dict, List, Optional, Set\n", "\n", "import nltk\n", "import numpy as np\n", "\n", "from nltk.corpus import reuters\n", "from bs4 import BeautifulSoup\n", "\n", "nltk.download('reuters')\n", "nltk.download('punkt')\n", "nltk.download('stopwords')\n", "\n", "STEMMER = nltk.stem.PorterStemmer()\n", "\n", "# the NLTK tokenizer produces some junk tokens, so add them to the stopwords\n", "STOPWORDS = set(nltk.corpus.stopwords.words(\"english\") + [\n", " \".\",\n", " \",\",\n", " \"<\",\n", " \">\",\n", " \"'s\",\n", " \"''\",\n", " \"``\",\n", " ]\n", ")\n", "\n", "\n", "def delete_headline(text: str) -> str:\n", " '''\n", " NLTK's sentence tokenizer includes the headline in the first sentence\n", " if we don't manually exlude it.\n", " '''\n", " lines = text.split(\"\\n\")\n", " return \"\\n\".join(lines[1:]) if len(lines) > 1 else text\n", "\n", "\n", "def global_freq(A: np.ndarray) -> np.ndarray:\n", " '''Sum the columns of the term-frequency index to get term global frequencies'''\n", " gf = np.zeros(A.shape[0])\n", " for i in range(A.shape[0]):\n", " for j in range(A.shape[1]):\n", " gf[i] += A[i, j]\n", " return gf\n", "\n", "\n", "def entropy(A: np.ndarray, gf: np.ndarray) -> np.ndarray:\n", " '''Compute the term entropy'''\n", " g = np.zeros(A.shape[0])\n", " logN = np.log(A.shape[1])\n", " for i in range(A.shape[0]):\n", " for j in range(A.shape[1]):\n", " p_ij = A[i, j] / gf[i]\n", " g[i] += p_ij * np.log(p_ij) if p_ij != 0 else 0\n", " g[i] = 1 + g[i] / logN\n", " return g\n", "\n", "\n", "def get_raw_tokens(file) -> List[str]:\n", " '''\n", " The raw NLTK documents contain a few HTML entities, so we'll use BeautifulSoup\n", " to decode them, then apply the NLTK word tokenizer. Skip the headline.\n", " '''\n", " raw = BeautifulSoup(delete_headline(reuters.raw(file)), \"html.parser\").get_text()\n", " return [t.casefold() for t in nltk.word_tokenize(raw) if t.casefold() not in STOPWORDS]\n", "\n", "\n", "def get_vocabulary(file) -> Set[str]:\n", " '''\n", " Returns the set of stemmed terms in the specified Reuters article (excluding headline).\n", " '''\n", " vocab: Set[str] = set()\n", " tokens = get_raw_tokens(file)\n", " stemmed = [STEMMER.stem(t.casefold()) for t in tokens]\n", " return set(stemmed)\n", "\n", "\n", "file_ids = reuters.fileids()\n", "train_files = [id for id in file_ids if id.startswith(\"train\")][:500]\n", "\n", "vocab: Set[str] = set()\n", "\n", "for file in train_files:\n", " vocab.update(get_vocabulary(file))\n", "\n", "ndocs = len(train_files)\n", "vocab_size = len(vocab)\n", "print(f\"{ndocs} articles {vocab_size} vocabulary\")" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "ikXbxGhGaq3d", "outputId": "1393b277-ead1-4ae1-88a0-5b8f30bd6924" }, "execution_count": 2, "outputs": [ { "output_type": "stream", "name": "stderr", "text": [ "[nltk_data] Downloading package reuters to /root/nltk_data...\n", "[nltk_data] Downloading package punkt to /root/nltk_data...\n", "[nltk_data] Unzipping tokenizers/punkt.zip.\n", "[nltk_data] Downloading package stopwords to /root/nltk_data...\n", "[nltk_data] Unzipping corpora/stopwords.zip.\n" ] }, { "output_type": "stream", "name": "stdout", "text": [ "500 articles 6275 vocabulary\n" ] } ] }, { "cell_type": "markdown", "source": [ "It will also be handy to have mappings back and forth between each term (as a string) and the term's row in term frequency matrix. Let's build those up." ], "metadata": { "id": "sMfucM66kMi9" } }, { "cell_type": "code", "source": [ "vocab_map: Dict[str, int] = dict()\n", "rev_map: List[str] = [''] * vocab_size\n", "for i, t in enumerate(vocab):\n", " vocab_map[t] = i\n", " rev_map[i] = t\n", "\n", "index = np.zeros((vocab_size, ndocs))\n", "for col, id in enumerate(train_files):\n", " tokens = get_raw_tokens(id)\n", " stemmed = [STEMMER.stem(t) for t in tokens]\n", " for term in stemmed:\n", " index[ vocab_map[term], col ] += 1\n", "\n", "gf = global_freq(index)\n", "g = entropy(index, gf)" ], "metadata": { "id": "Ueo9hlKtkdV6" }, "execution_count": 3, "outputs": [] }, { "cell_type": "markdown", "source": [ "Now we have the inputs we need to compute the term weights, so we can implement our summarization algorithm. But since we want to monitor our summarization process with whylogs, we'll need to do a little whylogs setup before we start summarizing.\n", "\n", "By default, whylogs uses a `TransientLogger` that produces a new profile for every `log()` call. For our example, it's nicer to aggregate all the logging into a singe profile. So we'll create a simple `PersistentLogger` to do that." ], "metadata": { "id": "WC7S59ADmq89" } }, { "cell_type": "code", "source": [ "from whylogs.api.logger.logger import Logger\n", "from whylogs.core import DatasetProfile, DatasetSchema\n", "from whylogs.core.configs import SummaryConfig\n", "from whylogs.core.dataset_profile import logger as dp_logger # because it doesn't like vectors\n", "from whylogs.core.preprocessing import ListView, PreprocessedColumn\n", "from whylogs.core.resolvers import MetricSpec, ResolverSpec, STANDARD_RESOLVER\n", "from whylogs.core.schema import DeclarativeSchema\n", "from whylogs.core.stubs import pd\n", "from whylogs.core.view.column_profile_view import ColumnProfileView\n", "from whylogs.experimental.extras.nlp_metric import BagOfWordsMetric\n", "\n", "class PersistentLogger(Logger):\n", " def __init__(self, schema: Optional[DatasetSchema] = None):\n", " super().__init__(schema)\n", " self._current_profile = DatasetProfile(schema=self._schema)\n", "\n", " def _get_matching_profiles(\n", " self,\n", " obj: Any = None,\n", " *,\n", " pandas: Optional[pd.DataFrame] = None,\n", " row: Optional[Dict[str, Any]] = None,\n", " schema: Optional[DatasetSchema] = None,\n", " ) -> List[DatasetProfile]:\n", " if schema and schema is not self._schema:\n", " raise ValueError(\n", " \"You cannot pass a DatasetSchema to an instance of PersistentLogger.log(),\"\n", " \"because schema is set once when instantiated, please use TimedRollingLogger(schema) instead.\"\n", " )\n", " return [self._current_profile]\n" ], "metadata": { "id": "TrF-XmKsn35M" }, "execution_count": 4, "outputs": [] }, { "cell_type": "markdown", "source": [ "We also need to attach the `BagOfWordsMetric` to the columns that represent our input articles and output summaries. We log each document as a list of its tokens." ], "metadata": { "id": "JI_c9bxIqy30" } }, { "cell_type": "code", "source": [ "from logging import ERROR\n", "dp_logger.setLevel(ERROR)\n", "\n", "resolvers = STANDARD_RESOLVER + [\n", " ResolverSpec(\n", " column_name = \"article_bow\",\n", " metrics = [MetricSpec(BagOfWordsMetric)]\n", " ),\n", " ResolverSpec(\n", " column_name = \"summary_bow\",\n", " metrics = [MetricSpec(BagOfWordsMetric)]\n", " )\n", "]\n", "schema = DeclarativeSchema(resolvers)\n", "why = PersistentLogger(schema=schema)" ], "metadata": { "id": "qndtdoXBrmtl" }, "execution_count": 5, "outputs": [] }, { "cell_type": "markdown", "source": [ "Now we're finally ready to do some summarization! We'll compute the log entropy weighted term vector for each article as a whole, then use NLTK's sentence tokenizer to split it into sentences. The first sentence that contains the word with the highest weight in the document will be our summary." ], "metadata": { "id": "SNm95G61tFtf" } }, { "cell_type": "code", "source": [ "profile = None\n", "for file in train_files:\n", " raw = BeautifulSoup(reuters.raw(file), 'html.parser').get_text()\n", " # print(raw.split('\\n')[0]) # print article headline\n", " # print(raw) # print the whole input article\n", " raw = delete_headline(raw)\n", " tokens = [t.casefold() for t in nltk.word_tokenize(raw) if t.casefold() not in STOPWORDS]\n", " stemmed = [STEMMER.stem(t) for t in tokens]\n", " doc_vec = np.zeros(vocab_size)\n", " for term in stemmed:\n", " doc_vec[ vocab_map[term] ] += 1\n", " max_weight = -1\n", " max_term = \"\"\n", " for i in range(vocab_size):\n", " doc_vec[i] = g[i] * np.log(doc_vec[i] + 1.0)\n", " if doc_vec[i] > max_weight:\n", " max_weight = doc_vec[i]\n", " max_term = rev_map[i]\n", " sentences = nltk.sent_tokenize(raw)\n", " max_sentence = \"\"\n", " for sentence in sentences:\n", " tokenized = [t.casefold() for t in nltk.word_tokenize(sentence) if t.casefold() not in STOPWORDS]\n", " stemmed = [STEMMER.stem(t) for t in tokenized]\n", " if max_term in stemmed:\n", " max_sentence = sentence\n", " profile = why.log(obj={\"article_bow\": tokens, \"summary_bow\": tokenized})\n", " break\n", " # max_sentence = max_sentence.replace(\"\\n\", \" \")\n", " # print(f\"{max_weight} {max_term}: {max_sentence}\")" ], "metadata": { "id": "8keC6Y-QtrpD" }, "execution_count": 6, "outputs": [] }, { "cell_type": "markdown", "source": [ "We've logged the full articles as the `article_bow` column and the summaries as the `summary_bow` column. Now let's grab the profile from the logger and take a look at it." ], "metadata": { "id": "hs-9ja982gCP" } }, { "cell_type": "code", "source": [ "def dump_summary(view: ColumnProfileView) -> None:\n", " summary = view.to_summary_dict()\n", " keys = [\n", " \"nlp_bow/doc_length:counts/n\",\n", " \"nlp_bow/doc_length:distribution/mean\",\n", " \"nlp_bow/doc_length:distribution/stddev\",\n", " \"nlp_bow/doc_length:distribution/max\",\n", " \"nlp_bow/doc_length:distribution/min\",\n", " \"nlp_bow/doc_length:distribution/median\",\n", "\n", " \"nlp_bow/term_length:counts/n\",\n", " \"nlp_bow/term_length:distribution/mean\",\n", " \"nlp_bow/term_length:distribution/stddev\",\n", " \"nlp_bow/term_length:distribution/max\",\n", " \"nlp_bow/term_length:distribution/min\",\n", " \"nlp_bow/term_length:distribution/median\",\n", " ]\n", " for key in keys:\n", " print(f\" {key}: {summary[key]}\")\n", " print(f\" frequent terms: {[t.value for t in summary['nlp_bow/frequent_terms:frequent_items/frequent_strings'][:10]]}\")\n", "\n", "\n", "view = profile.view()\n", "columns = view.get_columns()\n", "for col_name, col_view in columns.items():\n", " print(f\"{col_name}:\")\n", " dump_summary(col_view)\n", " print()" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "DcqenmOo21FB", "outputId": "d51d9137-8a81-4169-99af-a943289c3b3b" }, "execution_count": 7, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "article_bow:\n", " nlp_bow/doc_length:counts/n: 500\n", " nlp_bow/doc_length:distribution/mean: 88.38000000000004\n", " nlp_bow/doc_length:distribution/stddev: 89.40470907065252\n", " nlp_bow/doc_length:distribution/max: 504.0\n", " nlp_bow/doc_length:distribution/min: 1.0\n", " nlp_bow/doc_length:distribution/median: 59.0\n", " nlp_bow/term_length:counts/n: 44190\n", " nlp_bow/term_length:distribution/mean: 5.906223127404392\n", " nlp_bow/term_length:distribution/stddev: 2.5306350762162584\n", " nlp_bow/term_length:distribution/max: 24.0\n", " nlp_bow/term_length:distribution/min: 1.0\n", " nlp_bow/term_length:distribution/median: 6.0\n", " frequent terms: ['said', 'mln', 'dlrs', 'pct', 'vs', 'billion', 'year', 'cts', 'would', 'u.s.']\n", "\n", "summary_bow:\n", " nlp_bow/doc_length:counts/n: 500\n", " nlp_bow/doc_length:distribution/mean: 21.554000000000002\n", " nlp_bow/doc_length:distribution/stddev: 14.143095074153782\n", " nlp_bow/doc_length:distribution/max: 176.0\n", " nlp_bow/doc_length:distribution/min: 1.0\n", " nlp_bow/doc_length:distribution/median: 18.0\n", " nlp_bow/term_length:counts/n: 10777\n", " nlp_bow/term_length:distribution/mean: 5.419690080727475\n", " nlp_bow/term_length:distribution/stddev: 2.5998033619617535\n", " nlp_bow/term_length:distribution/max: 21.0\n", " nlp_bow/term_length:distribution/min: 1.0\n", " nlp_bow/term_length:distribution/median: 5.0\n", " frequent terms: ['vs', 'mln', 'said', 'cts', 'loss', 'net', 'dlrs', 'shr', 'inc', 'billion']\n", "\n" ] } ] }, { "cell_type": "markdown", "source": [ "As expected, we see that the summary documents are shorter than the original articles. We also see some differences and overlap in the most frequent words in the whole articles and the summaries." ], "metadata": { "id": "KR2FQYlN3Hom" } }, { "cell_type": "code", "source": [ "resolvers = STANDARD_RESOLVER + [\n", " ResolverSpec(\n", " column_name = \"original_bow\",\n", " metrics = [MetricSpec(BagOfWordsMetric)]\n", " ),\n", " ResolverSpec(\n", " column_name = \"split_bow\",\n", " metrics = [MetricSpec(BagOfWordsMetric)]\n", " )\n", "]\n", "schema = DeclarativeSchema(resolvers)\n", "why = PersistentLogger(schema=schema)\n", "\n", "import random\n", "\n", "profile = None\n", "for file in train_files:\n", " raw = BeautifulSoup(reuters.raw(file), 'html.parser').get_text()\n", " raw = delete_headline(raw)\n", " sentences = nltk.sent_tokenize(raw)\n", " for sentence in sentences:\n", " tokens = [t.casefold() for t in nltk.word_tokenize(sentence)]\n", " why.log(obj={\"original_bow\": np.array(tokens)})\n", " phrases = sentence.split(\",\")\n", " if len(phrases) > 1:\n", " index = random.randint(0, len(phrases))\n", " left = [t.casefold() for t in nltk.word_tokenize(\", \".join(phrases[:index]) + \".\")]\n", " right = [t.casefold() for t in nltk.word_tokenize(\", \".join(phrases[index:]))]\n", " why.log(obj={\"split_bow\": left})\n", " profile = why.log(obj={\"split_bow\": right})\n", " else:\n", " profile = why.log(obj={\"split_bow\": tokens})\n", "\n", "view = profile.view()\n", "columns = view.get_columns()\n", "for col_name, col_view in columns.items():\n", " print(f\"{col_name}:\")\n", " dump_summary(col_view)\n", " print()" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "0KFTk_cvX_kz", "outputId": "dcf01e96-8e37-4f5f-d5e4-94a5e3594b09" }, "execution_count": 8, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "original_bow:\n", " nlp_bow/doc_length:counts/n: 0\n", " nlp_bow/doc_length:distribution/mean: 0.0\n", " nlp_bow/doc_length:distribution/stddev: 0.0\n", " nlp_bow/doc_length:distribution/max: nan\n", " nlp_bow/doc_length:distribution/min: nan\n", " nlp_bow/doc_length:distribution/median: None\n", " nlp_bow/term_length:counts/n: 0\n", " nlp_bow/term_length:distribution/mean: 0.0\n", " nlp_bow/term_length:distribution/stddev: 0.0\n", " nlp_bow/term_length:distribution/max: nan\n", " nlp_bow/term_length:distribution/min: nan\n", " nlp_bow/term_length:distribution/median: None\n", " frequent terms: []\n", "\n", "split_bow:\n", " nlp_bow/doc_length:counts/n: 4545\n", " nlp_bow/doc_length:distribution/mean: 16.67216721672163\n", " nlp_bow/doc_length:distribution/stddev: 13.849889480931045\n", " nlp_bow/doc_length:distribution/max: 207.0\n", " nlp_bow/doc_length:distribution/min: 0.0\n", " nlp_bow/doc_length:distribution/median: 15.0\n", " nlp_bow/term_length:counts/n: 75775\n", " nlp_bow/term_length:distribution/mean: 4.389495216100277\n", " nlp_bow/term_length:distribution/stddev: 2.7034094987711406\n", " nlp_bow/term_length:distribution/max: 24.0\n", " nlp_bow/term_length:distribution/min: 1.0\n", " nlp_bow/term_length:distribution/median: 4.0\n", " frequent terms: ['the', '.', ',', 'to', 'of', 'in', 'said', 'and', 'a', 'mln']\n", "\n" ] } ] } ] }