# Chapter 2 — Tokenization: Turning Words into Numbers ## The 5-Year-Old Analogy Computers can only understand **numbers**. They don't know what the letter "A" means — they know "65" (its ASCII code). So we need to convert text into numbers before feeding it to a neural network. The simplest idea: **assign every word a number**: ``` "cat" -> 9246 "sat" -> 6734 "on" -> 389 "the" -> 279 "mat" -> 16789 ``` But English has hundreds of thousands of words. Do we really need a number for "antidisestablishmentarianism"? And what about new words like "skibidi" that didn't exist when we built the vocabulary? ## The Solution: Subword Tokenization (BPE) Instead of whole words, we break text into **frequent subword pieces**: ``` "unbelievably" -> "un" + "believ" + "ably" "running" -> "runn" + "ing" "cats" -> "cat" + "s" "lower" -> "low" + "er" "GPT" -> "G" + "P" + "T" ``` This is **Byte Pair Encoding (BPE)** — the exact algorithm used by GPT-2, GPT-3, GPT-4, and most modern models. ### How BPE Works — Step by Step BPE starts with every character as its own "token," then repeatedly merges the most frequent pair: **Starting text:** `"low lower lowest"` ``` Step 0 (initial — each character is a token): l o w _ l o w e r _ l o w e s t Step 1 (most frequent pair: 'l'+'o' -> 'lo'): lo w _ lo w e r _ lo w e s t Step 2 (most frequent pair: 'lo'+'w' -> 'low'): low _ low e r _ low e s t Step 3 (most frequent pair: 'e'+'s' -> 'es'): low _ low e r _ low es t Step 4 (most frequent pair: 'es'+'t' -> 'est'): low _ low e r _ low est Step 5 (most frequent pair: 'low'+'_' -> 'low_'): low_ low e r _ low_ est ``` After enough merges, we have a vocabulary like: `{l, o, w, e, r, s, t, _, lo, ow, low, er, es, est, low_}` Now new words can be represented using these pieces even if we've never seen them before: ``` "lowest" -> "low" + "est" (both in vocabulary!) "slower" -> "s" + "low" + "er" (never seen before, but works!) ``` ### Why BPE Beats Word-Level Tokenization | Problem | Word-Level | BPE | |---|---|---| | "running" vs "run" | Different tokens — no shared meaning | "runn" + "ing" — the model sees the connection | | New word: "rizz" | Unknown token → model fails | "r" + "i" + "z" + "z" → works with characters | | Vocabulary size | 500K+ (too many rare words) | 50K (balanced, efficient) | | Unicode/emoji handling | Often broken | Character-level fallback never fails | ### What About Special Characters and Emojis? BPE operates on **bytes**, not characters. This means it can tokenize ANYTHING that can be represented as bytes — emojis, Chinese characters, code, LaTeX, even binary data: ``` "Hello 😊" -> ["Hello", " Ġ", "😊"] (Ġ = space prefix in GPT tokenizer) "你好" -> tokenized via UTF-8 bytes "def foo():"-> ["def", "Ġfoo", "()", ":"] ``` ### GPT Tokenizer Conventions | Token | Example | Meaning | |---|---|---| | Normal tokens | `"cat"`, `"the"`, `"ing"` | Regular subword pieces | | Space-prefixed | `"Ġcat"`, `"Ġthe"` | Word starts after a space (Ġ is a special character) | | `<\|endoftext\|>` | EOS token | Marks end of a document — critical for training | | Capital letters | `"The"` vs `"the"` | Different tokens! Case matters | ### The EOS Token — Why It Matters The `<|endoftext|>` (End Of Sequence) token is **critical** and often overlooked: ```python # WITHOUT EOS — two documents get merged: doc1 = "The cat sat." # tokens: [464, 3797, 3332, 13] doc2 = "The dog ran." # tokens: [464, 3290, 3407, 13] # Result: [464, 3797, 3332, 13, 464, 3290, 3407, 13] # Model sees: "...sat. The dog ran." — thinks it's ONE document # Learns: "sat." is often followed by "The" — WRONG! # WITH EOS — documents are separated: tokens = [464, 3797, 3332, 13, EOS, 464, 3290, 3407, 13, EOS] # Model learns: EOS means "we're done here, next token is unrelated" ``` ## Tokenizer Code — Annotated ```python from dataclasses import dataclass import tiktoken @dataclass class TokenizerConfig: """ WHAT: Keeps all tokenizer settings in one place. WHY: Like a recipe card — consistent across the whole project. Change one value and everything updates automatically. """ name: str = "gpt2" # WHAT: use GPT-2's pretrained BPE tokenizer # WHY: same BPE as GPT-3/4 — 50K merges, # battle-tested on billions of documents, # and already trained (no weeks of work) vocab_size: int = 50257 # WHAT: total number of unique tokens # WHY: 50,257 is the exact GPT-2 vocabulary size # (50,000 merges + 256 byte tokens + 1 EOS) # This is the "goldilocks" number — # big enough for rare subwords, # small enough for fast matrix operations class SimpleTokenizer: """ WHAT: Wraps tiktoken to give us a friendly, consistent interface. WHY: tiktoken's raw API is low-level (you need to specify allowed_special every call). This wrapper makes encode/decode trivial — just call .encode("hello") and get tokens back. It also handles the EOS token consistently so we never accidentally forget to add it during training data prep. """ def __init__(self, config: TokenizerConfig = None): """ WHAT: Initialize the tokenizer with GPT-2's BPE vocabulary. WHY: We use a pretrained tokenizer because: 1. Training a tokenizer from scratch takes weeks of CPU time 2. GPT-2's tokenizer is open-source, fast, and well-tested 3. Using the same tokenizer as production models means our code works identically to how GPT-3 tokenizes """ self.config = config or TokenizerConfig() # WHAT: Load the GPT-2 encoding from tiktoken # WHY: tiktoken stores pretrained BPE merge tables. # get_encoding("gpt2") loads the exact 50K merges # that GPT-2 was trained with. self.enc = tiktoken.get_encoding(self.config.name) # WHAT: Define and encode the End-of-Sequence token # WHY: <|endoftext|> is the special token that marks boundaries # between documents. During training, we insert it between # every document so the model learns where one text ends # and another begins. self.eos_token = "<|endoftext|>" # The string representation self.eos_token_id = self.enc.encode( # Convert to its token ID self.eos_token, allowed_special={self.eos_token} # WHY: tiktoken blocks special tokens # by default for safety. We must # explicitly allow EOS encoding. )[0] # [0] because encode() returns a list — we want the single ID def encode(self, text: str) -> list[int]: """ WHAT: Turn text into a list of integer token IDs. WHY: Neural networks only eat numbers. Raw strings like "Hello world" mean nothing to matrix multiplication. Example: "Hello world" -> [15496, 995] Under the hood: tiktoken splits the text into subword pieces using the pretrained BPE merge table, then looks up each piece's ID in the vocabulary. """ # WHAT: Use tiktoken's fast C/Rust-based encoder # WHY: tiktoken is written in Rust, not Python. # It can tokenize hundreds of MB of text per second. # A pure Python BPE tokenizer would be 100x slower. return self.enc.encode(text, allowed_special={self.eos_token}) def decode(self, ids: list[int]) -> str: """ WHAT: Turn token IDs back into human-readable text. WHY: After the model generates a sequence of token IDs during inference, we need to convert them back to text so humans can read the output. Example: [15496, 995] -> "Hello world" """ return self.enc.decode(ids) @property def vocab_size(self) -> int: """ WHAT: How many unique tokens exist in the vocabulary. WHY: This number determines the size of our model's output layer — the final Linear layer must have vocab_size outputs (one score for each possible next token). 50,257 means the model chooses from 50,257 possibilities every time it predicts the next word. """ return self.config.vocab_size # ===== WHAT: Quick self-test ===== # WHY: Always test each component in isolation before combining. # "Does the tokenizer work?" is a 5-second check that saves # hours of debugging a misbehaving training loop. if __name__ == "__main__": tokenizer = SimpleTokenizer() # Test 1: Basic text test_text = "The cat sat on the mat." encoded = tokenizer.encode(test_text) decoded = tokenizer.decode(encoded) print(f"Test 1 — Basic:") print(f" Original: '{test_text}'") print(f" Encoded: {encoded}") print(f" Decoded: '{decoded}'") print(f" Match: {test_text == decoded}") # Test 2: EOS token eos = tokenizer.encode(tokenizer.eos_token) print(f"\nTest 2 — EOS token:") print(f" String: '{tokenizer.eos_token}'") print(f" Token ID: {tokenizer.eos_token_id}") print(f" Encode result: {eos}") # Test 3: Rare/unseen word rare = tokenizer.encode("antidisestablishmentarianism") decoded_rare = tokenizer.decode(rare) print(f"\nTest 3 — Rare word:") print(f" Encoded: {rare}") print(f" Pieces: {[tokenizer.decode([t]) for t in rare]}") print(f" Decoded: '{decoded_rare}'") # Test 4: Emoji/Unicode emoji = tokenizer.encode("Hello 😊 world") print(f"\nTest 4 — Emoji:") print(f" Encoded: {emoji}") print(f" Decoded: '{tokenizer.decode(emoji)}'") print(f"\n Vocab size: {tokenizer.vocab_size:,}") ``` **Expected output:** ``` Test 1 — Basic: Original: 'The cat sat on the mat.' Encoded: [464, 3797, 3332, 319, 262, 2603, 13] Decoded: 'The cat sat on the mat.' Match: True Test 2 — EOS token: String: '<|endoftext|>' Token ID: 50256 Encode result: [50256] Test 3 — Rare word: Encoded: [378, 420, 1634, 2013, 82, 622, 441, 979, 389] Pieces: ['ant', 'idis', 'establish', 'ment', 'ar', 'ian', 'ism'] Decoded: 'antidisestablishmentarianism' Test 4 — Emoji: Encoded: [15496, 52430, 23530, 248, 995] Decoded: 'Hello 😊 world' Vocab size: 50,257 ``` --- **Previous:** [Chapter 1 — Setup](01_setup.md) **Next:** [Chapter 3 — Embeddings](03_embeddings.md)