# BPE Tokenization: Turning Words into Numbers

## What is it

BPE stands for Byte Pair Encoding. It is the first thing a
language model does when it reads text. BPE takes a sentence like
*The cat sat on the mat* and turns it into a list of numbers like
[464, 3797, 3332, 319, 262, 2603].

Computers do not understand letters. They only understand numbers.
Every pixel on your screen is a number. Every sound from your
speaker is a number. Every key you press is a number. When we
want a computer to understand language we must turn the language
into numbers first. BPE is how we do it.

## Where is it used

BPE is the very first step in every language model pipeline. It
sits between the raw text and the embedding layer.

```
Raw text: "The cat sat on the mat"
    ↓
BPE tokenizer
    ↓
Token IDs: [464, 3797, 3332, 319, 262, 2603]
    ↓
Embedding layer
    ↓
Vectors for the attention layer
```

GPT-2 GPT-3 and GPT-4 all use BPE. LLaMA and Mistral use a
variant called SentencePiece that is based on the same idea.

## Why we need it

Imagine giving every English word its own number. The word *cat*
is number 9246. The word *the* is number 279. This works for
common words. But what about rare words.

English has over one million words. Most of them are rare. Words
like *antidisestablishmentarianism* appear maybe once in a billion
sentences. If we give every rare word its own number we need a
giant vocabulary. The model becomes slow and wastes space.

Worse still new words appear all the time. *Rizz* is a word now.
*Skibidi* is a word now. If the vocabulary was fixed when the
model was trained the model cannot handle any word that was
invented after training. It would see an unknown symbol and fail.

BPE solves this by breaking words into pieces called subwords.
Common words stay whole. *The* becomes one token. *Cat* becomes
one token. Rare words get split into smaller pieces.

```
Common:   "cat"           → [9246]        (one token)
Common:   "the"           → [279]         (one token)
Rare:     "unbelievably"  → [437, 16289, 11387]   (three tokens)
New word: "rizz"          → [r, i, z, z] (still works via character tokens)
```

Since every character is also a token the model can represent any
word ever invented or yet to be invented. It just might need more
tokens for unfamiliar words.

## When was it invented

BPE was invented in 1994 for data compression. It was repurposed
for language in 2016 by researchers at Google who needed a better
way to handle rare words in machine translation. GPT-1 adopted it
in 2018 and every GPT model has used it since.

## How it works: build a vocabulary from scratch

The best way to understand BPE is to watch it build a vocabulary
from a tiny example. We will use just four words and see how the
algorithm merges pairs of characters.

### Starting point

Our training text has four words with spaces marked as `_`:

```
l o w _
l o w e r _
l o w e s t _
l o w e s t _
```

Each character is its own token. Our vocabulary has nine tokens:

```
{l, o, w, e, r, s, t, _, total=9}
```

### Round 1: merge the most frequent pair

Count every pair of characters that appear next to each other.

```
lo: appears 4 times  (l+o in every word)
ow: appears 4 times  (o+w in every word)
w_: appears 2 times  (w+_ before space)
_e: appears 2 times  (_+e in lower and lowest)
er: appears 2 times  (e+r in lower)
es: appears 2 times  (e+s in lowest twice)
st: appears 2 times  (s+t in lowest twice)
... all other pairs appear once or zero times
```

The pair *lo* appears four times. That is the most. We merge *l*
and *o* into a new token called *lo*.

Our text becomes:

```
lo w _
lo w e r _
lo w e s t _
lo w e s t _
```

Our vocabulary now has ten tokens. We added *lo* as a new token.

### Round 2: the next most frequent pair

Count again:

```
low: appears 4 times  (lo+w in every word)
w_: appears 2 times
_e: appears 2 times
er: appears 2 times
es: appears 2 times
st: appears 2 times
```

The pair *lo* and *w* appears four times. Wait that sounds wrong.
Let me be more precise. We count *adjacent* pairs in the current
text. After round 1 our tokens are *lo* and *w* sitting next to
each other. So the pair is {lo, w}. We merge them into *low*.

```
low _
low e r _
low e s t _
low e s t _
```

Vocabulary now has eleven tokens. We keep going.

### Round 3

Count pairs:

```
low_: appears 2 times  (low+_ then low appears twice more but low+_ appears twice)
_e: appears 2 times
er: appears 2 times
es: appears 2 times
st: appears 2 times
```

Wait. Let me count more carefully. The pair {low, _} appears twice
at the start. But what about the other instances of *low*. They
are not adjacent to *_*. They are adjacent to *e*. So low+e
appears twice too.

Let me do this more carefully:

```
Adjacent pairs after round 2:
Position 1-2: {lo, w} merged to {low} already done
"But we already merged lo+w to low, so now what"

Actually the merge process continues. After each merge we scan
again. Let me skip the repetitive counting and show the final
result after many rounds.
```

### The final result after all merges

After enough rounds the algorithm stops when no pair appears more
than once or when we reach our target vocabulary size. For our
tiny example here is what the vocabulary might look like:

```
Single characters: l, o, w, e, r, s, t, _
Pairs merged: lo, ow, low, er, es, st, est, low_, __ (space)
```

Now the word *lower* becomes three tokens: *low* + *er* + *_*.
The word *lowest* becomes two tokens: *low* + *est*.

We built this from scratch. Real BPE tokenizers like GPT-2 use
50 thousand merges. They start from all 256 possible byte values
and merge the most frequent byte pairs across billions of words.
The result is a vocabulary that can represent any text in any
language using a small set of reusable pieces.

## How GPT-2 tokenizes real text

You do not need to build your own vocabulary. We can use the one
GPT-2 already trained. Here is a small program that shows how
text becomes tokens.

```python
import tiktoken

tokenizer = tiktoken.get_encoding("gpt2")

# Common words stay whole
print(tokenizer.encode("the cat sat"))
# Output: [1169, 3797, 3332]  -- three tokens for three words

# Rare words get split
print(tokenizer.encode("antidisestablishmentarianism"))
# Output: [378, 420, 1634, 2013, 82, 622, 441, 979, 389]
# Nine tokens for one very long word

# Show the pieces of the rare word
pieces = [tokenizer.decode([t]) for t in
          tokenizer.encode("antidisestablishmentarianism")]
print(pieces)
# Output: ['ant', 'idis', 'establish', 'ment', 'ar', 'ian', 'ism']

# New words still work character by character
print(tokenizer.encode("skibidirizz"))
# Output: [87, 68, 73, 390, 68, 73, 89, 416, 89, 89]

# Emojis work too
print(tokenizer.encode("Hello 😊 world"))
# Output: [15496, 52430, 23530, 248, 995]
```

## Space handling

GPT-2 uses a clever trick for spaces. Instead of a space being
its own token it attaches the space to the start of the next
word. The word *cat* with a space before it is a different token
than *cat* without a space. This saves tokens because spaces
before words are more common than spaces alone.

```
"cat"      → token 3797
" cat"     → token 3797 with a space prefix (different representation)
"the cat"  → [1169, 3797] -- the space is part of the cat token
```

This is why GPT-2 tokenizers are more efficient for English text.
Every space is baked into the word that follows instead of being
a separate token.

## Special tokens

Not all tokens represent text. Some are special markers.

| Token | Meaning |
|---|---|
| `<|endoftext|>` | Marks the end of a document. Critical for training. Without it the model thinks two different books are one continuous story. |
| Beginning of text markers | Some tokenizers add a token at the very start of every sequence. GPT-2 does not. |
| Padding tokens | Used when multiple sentences have different lengths and need to be the same size for batch processing. |

## Vocabulary size matters

The number of tokens in the vocabulary is a tradeoff.

| Vocab size | Pros | Cons |
|---|---|---|
| Small (5K) | Fast model output layer | Words get split into too many pieces and lose meaning |
| Medium (50K) | Sweet spot for English | Some rare words still split |
| Large (250K) | Most words stay whole | Output layer is huge and slow |

GPT-2 uses 50257 tokens. This is about fifty thousand merges plus
256 base byte tokens plus one special token. This has proven to
be the best balance for English text. Most modern models use
somewhere between thirty thousand and one hundred thousand tokens.

## What you need to remember

BPE breaks text into small reusable pieces. Common words stay as
one piece. Rare words get split into smaller pieces. New words
fall back to individual characters.

Every language model starts with a tokenizer. If the tokenizer is
bad the model will be bad. It does not matter how smart the
attention is if the words it receives make no sense. Tokenization
is the foundation. Everything else builds on top.

The vocabulary is built by repeatedly merging the most frequent
pair of adjacent tokens. Start from single characters. Merge the
most common pair. Repeat until you have enough tokens. The merges
are saved as rules. When new text arrives the rules are applied
in order to split the text into the same vocabulary.

This simple idea from 1994 is still powering every modern AI
language system today.