# Perplexity: The One Number That Measures Your Model ## The short answer Perplexity is a single number that tells you how good a language model is. Lower is better. A perplexity of 10 means the model is as confused as if it had to choose between 10 equally likely words at every step. A perplexity of 1 means the model knows exactly what word comes next every time. Real language models on real text usually have perplexity between 10 and 100. Perplexity is not abstract. It is the exponentiated cross entropy loss. If your training loss is 3.0 your perplexity is e to the power of 3 which is about 20. The model is as confused as if picking randomly among 20 options. ## Where the number comes from Every time the model predicts the next word it assigns a probability to every token in the vocabulary. The correct token gets some probability P. The model is uncertain about that prediction. Cross entropy loss measures that uncertainty as negative log of P. Perplexity is e to the power of that loss. ``` For a single prediction: Model says P("mat") = 0.25 Loss = -ln(0.25) = 1.386 Perplexity = e^1.386 = 4.0 Interpretation: The model was as uncertain as if it had to pick randomly among 4 equally likely options. ``` The magic of perplexity is that it translates an abstract loss number into something you can visualize. A loss of 1.386 means nothing to most people. A perplexity of 4 means the model is choosing among 4 options. That is tangible. You can imagine picking randomly from 4 words. ## Perplexity versus loss During training you watch the loss go down. The loss starts around 10.8 for a model with 50257 vocabulary tokens. That is uninterpretable. But you can convert it. ``` loss = 10.82: perplexity = e^10.82 ≈ 50,000 (random. Model knows nothing) loss = 7.0: perplexity = e^7.0 ≈ 1,100 (learning word frequencies) loss = 5.0: perplexity = e^5.0 ≈ 150 (learning basic grammar) loss = 3.0: perplexity = e^3.0 ≈ 20 (decent language model) loss = 2.0: perplexity = e^2.0 ≈ 7.4 (good language model) loss = 1.5: perplexity = e^1.5 ≈ 4.5 (very good) loss = 1.0: perplexity = e^1.0 ≈ 2.7 (excellent) ``` Perplexity gives you a mental model for what the loss actually means. When your loss goes from 10.8 to 7.0 you have not just improved by 3.8 units. You have gone from being as confused as 50000 options to being as confused as 1100 options. That is dramatic improvement. ## How to compute it In code perplexity is one line. ```python import math loss = 3.0 # Your model's cross entropy loss perplexity = math.exp(loss) print(f"Loss: {loss:.4f}") print(f"Perplexity: {perplexity:.2f}") print(f"The model is as uncertain as picking among {perplexity:.0f} options.") ``` For a batch of predictions compute the average loss first then exponentiate. ```python total_loss = 0 total_tokens = 0 for input_ids, target_ids in dataloader: with torch.no_grad(): logits = model(input_ids) loss = F.cross_entropy( logits.view(-1, vocab_size), target_ids.view(-1), reduction='sum' ) total_loss += loss.item() total_tokens += target_ids.numel() average_loss = total_loss / total_tokens perplexity = math.exp(average_loss) print(f"Validation perplexity: {perplexity:.2f}") ``` Always compute perplexity on data the model has not seen during training. Training perplexity can be misleadingly low because the model has memorized parts of the training data. Validation perplexity measures how well the model generalizes. ## What different perplexity values mean ### Perplexity around 50000 Your model is random. It assigns equal probability to every token in the vocabulary. It has learned nothing. This is normal at step zero of training. If it stays here after thousands of steps something is broken. Check your loss function and optimizer. ### Perplexity around 1000 The model has learned that some words are more common than others. It knows that *the* appears often and *xylophone* appears rarely. It uses these frequencies in its predictions. But it does not yet understand word order or grammar or meaning. The output is gibberish but the gibberish contains common words in roughly the right proportions. ### Perplexity around 100 The model has learned basic grammar. It knows that articles precede nouns. It knows that verbs agree with subjects in number. It knows that periods end sentences. The output has recognizable sentence structure even if the content is nonsensical. This is where most small models plateau after limited training. ### Perplexity around 20 The model writes coherent text. Sentences have subjects and verbs and objects in the right order. The content is sometimes factual and sometimes invented. This is the level of GPT-1 from 2018. A model with 17 million parameters trained on 100 million tokens might reach this level. ### Perplexity around 10 The model writes good text. The content is mostly factual. Few obvious errors. This is the level of GPT-2 from 2019. A model with 150 million parameters trained on billions of tokens can reach this. ### Perplexity around 5 The model writes excellent text. Rarely makes factual errors. Handles complex reasoning. This is the level of GPT-3 from 2020 and modern small models like LLaMA 7B. Training these models costs millions of dollars. ### Perplexity below 3 The model approaches human performance on language modeling. It predicts what a human would write with high accuracy. Models at this level are measured on harder tasks like question answering and code generation because perplexity stops being a useful metric. The difference between perplexity 2.5 and 2.3 is hard to feel but expensive to achieve. ## Why perplexity is not everything Perplexity measures how well the model predicts the next word. It does not measure whether the model is helpful or truthful or safe or creative. A model can have great perplexity and still generate harmful content. It can have great perplexity and still hallucinate facts. It can have great perplexity and still be boring. Perplexity also depends on the dataset. A model trained on children's books will have low perplexity on children's books and high perplexity on legal documents. Perplexity is always relative to the test data. Comparing perplexity between models is only meaningful when the models are evaluated on the same dataset with the same tokenizer. Different tokenizers produce different perplexity values for the same model on the same data. A tokenizer with a larger vocabulary usually gives lower perplexity because each token encodes more information and there are fewer predictions to make per sentence. This is why you cannot compare perplexity between models that use different tokenizers. ## The relationship to bits per character Perplexity can be converted to bits per character. Bits per character measures how many bits of information the model needs on average to encode each character of text. Lower is better. The relationship is: ``` bits_per_character = ln(perplexity) / (ln(2) × characters_per_token) ``` For GPT-2 each token covers about 4 characters on average. ``` perplexity = 20: bits_per_char = ln(20) / (0.693 × 4) ≈ 1.08 bits per character perplexity = 10: bits_per_char = ln(10) / (0.693 × 4) ≈ 0.83 bits per character ``` This tells you that the model needs about one bit of information per character to encode English text. The theoretical minimum entropy of English is about 0.6 to 1.0 bits per character. Models are approaching that limit. Further improvements in perplexity will require better understanding of meaning not just better statistics. ## Perplexity during training A good training run should show perplexity decreasing smoothly. The starting perplexity should be close to the vocabulary size. The final perplexity depends on model size and data quality and training duration. ``` Step Loss Perplexity 0 10.82 50,257 Model knows nothing 100 9.23 10,240 Learning frequencies 500 7.45 1,720 Learning word patterns 1,000 6.12 455 Emerging grammar 5,000 4.23 69 Coherent phrases 10,000 3.45 31 Decent sentences 50,000 2.89 18 Good model ``` If perplexity stops decreasing before step 10000 something is limiting the model. The learning rate might be too low. The model capacity might be exhausted. The data might not contain enough patterns to learn from. Try increasing the learning rate or the model size or the dataset size. If perplexity decreases on training data but increases on validation data the model is overfitting. It is memorizing the training set instead of learning general patterns. The training and validation curves diverge. Add more dropout or weight decay or use early stopping. ## Perplexity of famous models All numbers are approximate and depend on the evaluation dataset. ``` Model Params Perplexity (WikiText-103) Random baseline : 50,257 GPT-1 (2018) 117M ~35 GPT-2 Small (2019) 124M ~19 GPT-2 Medium 350M ~15 GPT-2 Large 774M ~12 GPT-3 Small (2020) 125M ~18 GPT-3 XL 1.3B ~10 GPT-3 6.7B ~8 GPT-3 175B ~5 LLaMA 7B (2023) 7B ~7 LLaMA 13B ~6 LLaMA 70B ~4 ``` Notice something. GPT-2 Small at 124 million parameters achieves perplexity 19. GPT-3 at 125 million parameters achieves perplexity 18. Same architecture. Same size. Different training. GPT-3 was trained on more data for longer. The extra compute improved perplexity even without increasing model size. This is why data quality and training duration matter as much as model architecture. ## What you need to remember Perplexity is the exponentiated cross entropy loss. A perplexity of N means the model is as uncertain as if it had to pick randomly among N options. Lower is better. Random models start at around the vocabulary size. Good models reach single digits. Perplexity translates the abstract loss number into something visual. When your loss drops from 10 to 5 your model has gone from being confused among 22000 options to being confused among 150 options. That is the difference between a model that knows nothing and a model that knows something. Perplexity does not measure helpfulness or truthfulness or safety. It only measures how well the model predicts the next word. For evaluating whether your model is useful you need other metrics. But for tracking whether your model is learning perplexity is the single most important number.