{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Transformers installation\n", "! pip install transformers datasets\n", "# To install from source instead of the last release, comment the command above and uncomment the following one.\n", "# ! pip install git+https://github.com/huggingface/transformers.git" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Perplexity of fixed-length models" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Perplexity (PPL) is one of the most common metrics for evaluating language models. Before diving in, we should note\n", "that the metric applies specifically to classical language models (sometimes called autoregressive or causal language\n", "models) and is not well defined for masked language models like BERT (see [summary of the models](https://huggingface.co/docs/transformers/main/en/model_summary)).\n", "\n", "Perplexity is defined as the exponentiated average negative log-likelihood of a sequence. If we have a tokenized\n", "sequence $X = (x_0, x_1, \\dots, x_t)$, then the perplexity of $X$ is,\n", "\n", "$$\\text{PPL}(X) = \\exp \\left\\{ {-\\frac{1}{t}\\sum_i^t \\log p_\\theta (x_i|x_{\n", "\n", "When working with approximate models, however, we typically have a constraint on the number of tokens the model can\n", "process. The largest version of [GPT-2](https://huggingface.co/docs/transformers/main/en/model_doc/gpt2), for example, has a fixed length of 1024 tokens, so we\n", "cannot calculate $p_\\theta(x_t|x_{\n", "\n", "This is quick to compute since the perplexity of each segment can be computed in one forward pass, but serves as a poor\n", "approximation of the fully-factorized perplexity and will typically yield a higher (worse) PPL because the model will\n", "have less context at most of the prediction steps.\n", "\n", "Instead, the PPL of fixed-length models should be evaluated with a sliding-window strategy. This involves repeatedly\n", "sliding the context window so that the model has more context when making each prediction.\n", "\n", "\"Sliding\n", "\n", "This is a closer approximation to the true decomposition of the sequence probability and will typically yield a more\n", "favorable score. The downside is that it requires a separate forward pass for each token in the corpus. A good\n", "practical compromise is to employ a strided sliding window, moving the context by larger strides rather than sliding by\n", "1 token a time. This allows computation to proceed much faster while still giving the model a large context to make\n", "predictions at each step." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example: Calculating perplexity with GPT-2 in 🤗 Transformers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's demonstrate this process with GPT-2." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from transformers import GPT2LMHeadModel, GPT2TokenizerFast\n", "\n", "device = \"cuda\"\n", "model_id = \"gpt2-large\"\n", "model = GPT2LMHeadModel.from_pretrained(model_id).to(device)\n", "tokenizer = GPT2TokenizerFast.from_pretrained(model_id)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll load in the WikiText-2 dataset and evaluate the perplexity using a few different sliding-window strategies. Since\n", "this dataset is small and we're just doing one forward pass over the set, we can just load and encode the entire\n", "dataset in memory." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from datasets import load_dataset\n", "\n", "test = load_dataset(\"wikitext\", \"wikitext-2-raw-v1\", split=\"test\")\n", "encodings = tokenizer(\"\\n\\n\".join(test[\"text\"]), return_tensors=\"pt\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With 🤗 Transformers, we can simply pass the `input_ids` as the `labels` to our model, and the average negative\n", "log-likelihood for each token is returned as the loss. With our sliding window approach, however, there is overlap in\n", "the tokens we pass to the model at each iteration. We don't want the log-likelihood for the tokens we're just treating\n", "as context to be included in our loss, so we can set these targets to `-100` so that they are ignored. The following\n", "is an example of how we could do this with a stride of `512`. This means that the model will have at least 512 tokens\n", "for context when calculating the conditional likelihood of any one token (provided there are 512 preceding tokens\n", "available to condition on)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import torch\n", "from tqdm import tqdm\n", "\n", "max_length = model.config.n_positions\n", "stride = 512\n", "\n", "nlls = []\n", "for i in tqdm(range(0, encodings.input_ids.size(1), stride)):\n", " begin_loc = max(i + stride - max_length, 0)\n", " end_loc = min(i + stride, encodings.input_ids.size(1))\n", " trg_len = end_loc - i # may be different from stride on last loop\n", " input_ids = encodings.input_ids[:, begin_loc:end_loc].to(device)\n", " target_ids = input_ids.clone()\n", " target_ids[:, :-trg_len] = -100\n", "\n", " with torch.no_grad():\n", " outputs = model(input_ids, labels=target_ids)\n", " neg_log_likelihood = outputs[0] * trg_len\n", "\n", " nlls.append(neg_log_likelihood)\n", "\n", "ppl = torch.exp(torch.stack(nlls).sum() / end_loc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Running this with the stride length equal to the max input length is equivalent to the suboptimal, non-sliding-window\n", "strategy we discussed above. The smaller the stride, the more context the model will have in making each prediction,\n", "and the better the reported perplexity will typically be.\n", "\n", "When we run the above with `stride = 1024`, i.e. no overlap, the resulting PPL is `19.64`, which is about the same\n", "as the `19.93` reported in the GPT-2 paper. By using `stride = 512` and thereby employing our striding window\n", "strategy, this jumps down to `16.53`. This is not only a more favorable score, but is calculated in a way that is\n", "closer to the true autoregressive decomposition of a sequence likelihood." ] } ], "metadata": {}, "nbformat": 4, "nbformat_minor": 4 }