# Embeddings: Giving Numbers Meaning

## What is it

An embedding is a list of numbers that captures the meaning of
a word.

After tokenization every word is just a number. *Cat* is 9246.
*Dog* is 4821. These numbers are just labels. The number 9246
means nothing by itself. The model cannot learn from a number
like 9246 because 9246 is not closer to 4821 than it is to 279.
They are just arbitrary IDs.

An embedding turns that number into a vector. A vector is just
a list of decimal numbers. For GPT-2 each word becomes a list of
768 numbers. These numbers are not arbitrary. Words with similar
meanings get similar lists. Words with different meanings get
different lists.

```
Token ID 9246 ("cat") → [0.023, -0.451, 0.789, ..., -0.102]  (768 numbers)
Token ID 4821 ("dog")  → [0.019, -0.443, 0.795, ..., -0.098]  (very similar!)
Token ID 279  ("the")  → [0.891, 0.112, -0.334, ..., 0.567]   (very different!)
```

The key idea is that *cat* and *dog* are near each other in this
number space because they are both animals. *The* is far away
because it is a function word with a completely different role.

## Where is it used

The embedding layer sits right after the tokenizer and right
before the attention layers. It is the bridge between the
tokenizer's integer output and the transformer's floating point
input.

```
Raw text: "The cat"
    ↓
Tokenizer: [1169, 3797]
    ↓
Embedding layer: two vectors of 768 numbers each
    ↓
Transformer blocks
```

Every modern language model has an embedding layer. It is the
very first learned component in the entire pipeline.

## Why we need it

Without embeddings the model would be trying to do math on token
IDs. Imagine adding two words together. The token for *king* is
9246. The token for *queen* is 9247. If we added them we would get
18493. That number means nothing. It does not correspond to a
meaningful word. The model cannot learn from token IDs.

With embeddings the model works with continuous vectors. The
vector for *king* is something like [0.3, -0.5, 0.8, ...]. The
vector for *queen* is [0.2, -0.6, 0.7, ...]. These are close but
not identical. The model can compute *king* minus *man* plus
*woman* and get something very close to the vector for *queen*.
This is called the embedding arithmetic property.

```
embedding(king)  ≈ [0.30, -0.50, 0.80]
embedding(man)   ≈ [0.25, -0.45, -0.30]
embedding(woman) ≈ [0.22, -0.55, 0.70]
embedding(queen) ≈ [0.27, -0.60, 0.75]

king - man + woman = [0.27, -0.60, 0.75] ≈ queen!
```

This was not programmed by a human. The model discovered that
changing the gender of a word is like moving in a straight line
through the embedding space. It learned this entirely from
reading millions of sentences where *king* and *queen* appeared
in similar contexts but with different pronouns.

## When was it invented

The idea of word embeddings is old. A technique called Word2Vec
was published by Google in 2013. It was the first to show that
word vectors could capture meaning relationships. The embedding
layer in transformers is a direct descendant of Word2Vec. The
difference is that Word2Vec embeddings were precomputed and
frozen. Transformer embeddings are learned from scratch during
training. They adapt to the specific task the model is learning.

## How it works: a giant lookup table

Think of the embedding layer as a table with 50257 rows. Each row
has 768 columns. Row zero is the vector for token zero. Row one
is the vector for token one. Row 3797 is the vector for the word
*cat*. The forward pass of the embedding layer is just looking up
rows in this table.

```python
# Given token IDs: [1169, 3797]
# Look up row 1169 → vector for "The"  (768 numbers)
# Look up row 3797 → vector for "cat"  (768 numbers)
# Return both vectors
```

That is the entire forward pass of the embedding layer. No
multiplication. No activation function. Just a table lookup.

### How the table is built

The table starts completely random. Every row is filled with
numbers drawn from a normal distribution with mean 0 and standard
deviation 0.02. At this point *cat* and *dog* are as close to
each other as *cat* and *democracy*. Everything is random noise.

Then training begins. The model reads a sentence like *The cat
sat on the mat*. It predicts that *mat* should come next. If it
predicts wrong the loss is high. Backpropagation sends a tiny
signal back through the entire model including the embedding
table. That signal says:

"The vector for *cat* should be nudged slightly toward the
direction that helps predict *mat* next time."

After millions of training steps the table transforms. Words that
appear in similar contexts get pushed toward similar positions.
The vector for *cat* moves close to *dog* and *pet* and *feline*.
The vector for *car* moves close to *vehicle* and *drive* and
*road*. The space organizes itself into neighborhoods of meaning.

### What the neighborhoods look like

After training the 768 dimensional space has natural structure.
Some directions in this space correspond to real world concepts.

```
Direction 1 (dimensions 0 through 63):    Living vs non living
Direction 2 (dimensions 64 through 127):   Big vs small
Direction 3 (dimensions 128 through 191):  Positive vs negative
Direction 4 (dimensions 192 through 255):  Formal vs casual
... and so on through all 768 dimensions
```

These directions were never programmed. They emerged naturally
because the model found it useful to organize words this way. When
the model needs to know if something is alive or not it looks at
a specific set of dimensions in the embedding vector.

## A tiny code example

```python
import torch
import torch.nn as nn

# Create a tiny embedding table
vocab_size = 1000   # 1000 unique tokens
d_model = 4         # 4 dimensional vectors (small for the example)

embedding = nn.Embedding(vocab_size, d_model)

# Look up some token IDs
token_ids = torch.tensor([[12, 45, 678]])
vectors = embedding(token_ids)

print("Token IDs:", token_ids)
print("Shape of output:", vectors.shape)
print()
print("Vector for token 12:", vectors[0, 0].tolist())
print("Vector for token 45:", vectors[0, 1].tolist())
print("Vector for token 678:", vectors[0, 2].tolist())
print()
print("Each token ID became a", d_model, "dimensional vector.")
print("Right now the values are random. After training they will")
print("capture meaning. Words with similar meanings will have")
print("similar vectors.")
```

Running this code you will see something like:

```
Token IDs: tensor([[ 12,  45, 678]])
Shape of output: torch.Size([1, 3, 4])

Vector for token 12:  [0.031, -0.124, -0.847, 0.562]
Vector for token 45:  [-1.231, 0.789, 0.023, -0.441]
Vector for token 678: [0.892, -0.334, 0.671, -0.128]
```

These vectors are random right now. They have no meaning. After
training on billions of sentences token 12 and token 45 will be
near each other if they appear in similar contexts or far apart
if they do not.

## The size of the embedding table

The embedding table is often the largest component in the model
in terms of parameter count.

```
GPT-2 Small:  50257 words × 768 dims  = 38.6 million numbers
GPT-3:        50257 words × 12288 dims = 617 million numbers
```

This is why weight tying is important. The output layer also needs
a matrix of the same size to project back from hidden states to
vocabulary predictions. Instead of storing two giant matrices we
share one. The embedding table is used for both input and output.

## Embeddings for punctuation and special characters

Every token gets an embedding. Even punctuation and special
symbols. The period gets an embedding. The comma gets an
embedding. The end of text marker gets an embedding.

These embeddings are just as important as word embeddings. The
model learns that the embedding for a period is followed by the
embedding for a capitalized word. It learns that the embedding
for a question mark is followed by the embedding for an answer.
The structure of language lives in these small token embeddings
as much as it lives in the word embeddings.

## What you need to remember

An embedding is a list of numbers that represents a word's
meaning. Words with similar meanings have similar lists. Words
with different meanings have different lists.

The embedding table starts random. Training moves words around
based on the contexts they appear in. After enough training the
space organizes itself. King minus man plus woman equals queen.
This was not programmed. The model discovered it.

The embedding layer is just a lookup table. No math inside. Give
it a token ID and it returns a vector. That vector is the word's
coordinates in meaning space. Everything the model knows about a
word is packed into those 768 numbers.