# How a GPT Really Works: The Complete Story

This is the story of a language model. Not just one part. Not just one
step. The whole thing. From a text file on a hard drive to a machine
that can write poetry and answer questions and generate code. Every
single piece. Every single decision. Every single number.

We will build a GPT from scratch. We will train it. We will watch it
learn. We will run it. By the end you will understand every line of
code in every file. This story assumes you know Python. Nothing else.

---

## Part 1: What Are We Building

A GPT is a next word predictor. That is it. That is the whole thing.
You give it some words. It guesses what word comes next. Then it takes
that guess and guesses the next word. Then the next. Eventually it has
written a paragraph or a poem or a legal document or a recipe for
chocolate cake. But underneath it is always just guessing one word at
a time.

The model has about 150 million knobs. Each knob is a number. Training
means finding the right numbers for all 150 million knobs so that the
model's guesses match what a human would write. Once those numbers are
found the model can write text that is sometimes indistinguishable from
human writing.

How do we find those numbers. We show the model sentences from the
internet. Billions of sentences. For each sentence we hide the last word
and ask the model to guess it. When it guesses wrong we figure out which
knobs to turn and in which direction to make the guess better next time.
We repeat this billions of times. The knobs slowly converge to values
that capture the patterns of human language.

The architecture of the model determines which patterns it can capture.
A bigger model can capture more patterns. A better architecture can
capture more patterns with the same number of knobs. Our architecture
is the same one used by LLaMA 3 and Mistral and Qwen. It represents the
best publicly documented design for language models as of 2025.

---

## Part 2: The Data

Before we can train a model we need text. Lots of text. Billions of
words. For this project we will use Wikipedia because it is freely
available and well written and covers almost every topic humans have
thought about.

Wikipedia can be downloaded as a single XML file or accessed through the
HuggingFace datasets library. The datasets library handles downloading
and caching so we do not have to manage the raw files ourselves.

```python
from datasets import load_dataset

dataset = load_dataset("wikitext", "wikitext-103-raw-v1", split="train")
texts = [item["text"] for item in dataset if item["text"].strip()]
```

The WikiText-103 dataset contains about 29000 Wikipedia articles. They
have been lightly processed to remove markup and metadata. What remains
is clean flowing English text. Exactly what we need.

But we cannot feed raw text to a neural network. Neural networks eat
numbers. We need to convert our text into numbers first.

---

## Part 3: Tokenization . Text Becomes Numbers

The conversion from text to numbers is called tokenization. The
algorithm we use is Byte Pair Encoding. BPE for short. It was invented
in 1994 for data compression and repurposed for language models in 2016.

The idea is simple in concept. Start with every character as its own
token. Find the most common pair of adjacent tokens in the training
data. Merge them into a new token. Repeat until you have 50000 tokens.

Let us see how this works on a tiny example. Imagine our training data
contains only four words with spaces marked as unders.

```
l o w _
l o w e r _
l o w e s t _
l o w e s t _
```

Each letter and underscore is a separate token. We have nine tokens
total. The alphabet is small. The model would need many tokens to
represent even a short sentence. So we merge.

The most common pair is l and o. They appear together four times in the
word low. We create a new token lo. Now our text is shorter.

```
lo w _
lo w e r _
lo w e s t _
lo w e s t _
```

We have ten tokens. We keep merging. The next most common pair is lo and
w. They appear together four times. We create low. Now our text is even
shorter.

```
low _
low e r _
low e s t _
low e s t _
```

We continue. After many rounds of merging our vocabulary contains useful
pieces like low and er and est and the space marker. Now the word lowest
which is not in our original training data can still be represented as
low plus est. Two tokens instead of six characters. Compression and
generalization in one step.

Real BPE tokenizers like GPT-2 use 50000 merges. They start from all
256 possible byte values as the base alphabet. This means they can
tokenize any text in any language that can be represented as bytes which
is all text. The 50000 merges capture the most common patterns across
billions of words. The result is a vocabulary that can represent common
words as single tokens and rare words as sequences of a few tokens and
completely unseen words as sequences of individual byte tokens.

```python
import tiktoken

tokenizer = tiktoken.get_encoding("gpt2")
text = "The cat sat on the mat."
tokens = tokenizer.encode(text)
print(tokens)  # [464, 3797, 3332, 319, 262, 2603, 13]
```

Seven tokens. Each token is an integer between 0 and 50256. These
integers are the only thing the model ever sees. The raw text is gone.
The model lives in a world of integers.

The tokenizer has one special token that deserves attention. Token
50256 is the end of text marker. It is placed between every document
in the training data. Without it the model would think that the last
sentence of one Wikipedia article flows naturally into the first
sentence of the next. The end of text token is the model's signal that
one thought has ended and a new unrelated thought has begun.

Every token has a unique ID. Token 464 is always The with a capital T.
Token 3797 is always cat. Token 13 is always a period. These mappings
are fixed. They never change during training. The tokenizer is not
part of the neural network. It is a preprocessing step with its own
separate algorithm.

But these integer IDs are just labels. The number 3797 has no
mathematical relationship to the number 2603. The model cannot learn
meaningful patterns from these raw integers. We need to give each
token a richer representation. We need embeddings.

---

## Part 4: Embeddings . Numbers Become Meaning

An embedding is a vector of floating point numbers that captures the
meaning of a token. For our model each token gets a vector of 768
numbers. Token 3797 gets 768 numbers. Token 2603 gets 768 numbers.
Every one of the 50257 tokens gets its own row in a giant lookup table
of shape 50257 by 768.

```python
embedding_table = torch.nn.Embedding(50257, 768)
```

This table is just a matrix. Row 3797 is the embedding for cat. Row 2603
is the embedding for mat. When the model needs the vector for token 3797
it just reads row 3797 from the table. No multiplication. No activation
function. Just a memory lookup.

At initialization every row is filled with random numbers drawn from a
normal distribution with mean 0 and standard deviation 0.02. This means
most values are between -0.04 and 0.04. At this moment cat and dog have
no special relationship. They are just two random rows in a random
table. Every token is equally random.

Training changes this. Over billions of training steps the rows are
updated. Tokens that appear in similar contexts get pushed toward
similar values. Tokens that appear in different contexts get pushed
apart. After training the embedding for cat will be very close to the
embedding for dog. Both will be far from the embedding for democracy.
The space organizes itself into neighborhoods of meaning.

```
cat    = [ 0.34, -0.12,  0.78, -0.56, ...]  (768 numbers)
dog    = [ 0.31, -0.15,  0.81, -0.52, ...]  (very similar to cat)
democracy = [ 0.89, 0.67, -0.23,  0.91, ...]  (completely different)
```

The embedding table is the largest single component in the model. It has
50257 rows times 768 columns equals about 38.6 million numbers. That is
roughly a quarter of all the parameters in our model. Those 38.6
million numbers encode everything the model knows about what words mean.

---

## Part 5: Positional Encoding . The Model Learns Order

The transformer reads all tokens at once. There is no left to right
processing. No step by step recurrence. Every token is processed
simultaneously. This is a strength because it is fast and parallelizable.
But it is also a problem because the model has no way to know which
token came first and which came last.

Consider two sentences. The dog bit the man. The man bit the dog. Same
words. Different order. Completely different meaning. If the model
treated every token independently it could not distinguish these
sentences. The meaning would be scrambled.

We need to stamp each token with its position. Tell the model where in
the sentence this token sits. The modern way to do this is called Rotary
Position Embeddings. RoPE for short. It was introduced in 2021 and
adopted by LLaMA in 2023. Every major model since uses it.

Instead of ADDING a position number to the embedding RoPE ROTATES the
query and key vectors by an angle that depends on the position. The
rotation preserves the vector's magnitude so it does not change the
meaning of the word. But the rotation changes the direction so the
attention dot product between two words becomes a function of their
distance apart.

```
Word at position 1: rotated by angle θ₁
Word at position 4: rotated by angle θ₄

Attention score between them:
Q₁ · K₄ = original_dot × cos(θ₁ - θ₄) + cross_term × sin(θ₁ - θ₄)

The result depends on (θ₁ - θ₄) which is a function of (4 - 1) = 3
steps apart. Not on positions 1 and 4 themselves. Only on the distance.
```

This is the key insight. RoPE makes attention depend on relative
position. Words three steps apart always get the same rotational
relationship regardless of whether they appear at positions 0 and 3 or
positions 497 and 500. The transformer cares about how far apart two
words are not about where they sit in absolute terms.

The angles are precomputed and stored. Each pair of dimensions rotates
at a different speed. The first pair of dimensions rotates fastest and
captures local word order. The last pair rotates slowest and captures
long range position. This multi scale approach means the model has both
fine grained local position information and coarse grained global
position information.

```python
# Precompute rotation angles for every position
dim_indices = torch.arange(0, d_model, 2).float()
inv_freq = 1.0 / (10000.0 ** (dim_indices / d_model))
positions = torch.arange(max_seq_len).float()
freqs = torch.outer(positions, inv_freq)
emb = freqs.repeat_interleave(2, dim=-1)  # [max_seq_len, d_model]
cos_cached = emb.cos()
sin_cached = emb.sin()
```

During the forward pass we look up the precomputed cosine and sine
values for each position and apply the rotation. The rotation formula
for each pair of dimensions (x₀, x₁) at position p is:

```
x₀' = x₀ × cos(θ_p) - x₁ × sin(θ_p)
x₁' = x₀ × sin(θ_p) + x₁ × cos(θ_p)
```

This is a standard 2D rotation. Applied to every pair of dimensions in
the query and key vectors. The values are not rotated because position
information is only needed for deciding WHICH tokens to attend to not
for the content of the tokens themselves.

---

## Part 6: Attention . The Core Mechanism

Attention is the heart of the transformer. Everything else is support
infrastructure. The embedding layer feeds attention. The feed forward
network refines attention's output. The normalization layers keep
attention stable. But attention is where the model actually understands
relationships between words.

### The intuition

Imagine you are reading a long sentence. Some words are more important
than others for understanding what is happening. If the sentence is The
cat that had been sitting on the mat for three hours finally stretched
and yawned you need to connect stretched and yawned with cat across
twelve intervening words. Your brain does this automatically. Attention
does it mathematically.

For every word in the sentence the model creates three vectors. A Query
vector that asks what am I looking for. A Key vector that says what do
I have to offer. A Value vector that holds my actual content. Every word
compares its Query against every other word's Key. Words with high match
scores get more attention. Their Values are weighted more heavily in the
output.

### The computation step by step

Let us trace through attention for a concrete sentence. Our sentence is
The cat sat on the mat. Seven tokens. We will look at one attention head
with a head dimension of 64.

The input to attention is a matrix of shape 7 by 768. Seven tokens each
represented by 768 numbers. We project this matrix into three new
matrices of shape 7 by 64. One for Query. One for Key. One for Value.

```
Q = input @ W_q  (7 × 768 @ 768 × 64 = 7 × 64)
K = input @ W_k  (7 × 768 @ 768 × 64 = 7 × 64)
V = input @ W_v  (7 × 768 @ 768 × 64 = 7 × 64)
```

The weight matrices W_q W_k and W_v are learned during training. They
are what make each attention head different. Different heads learn
different projections that capture different linguistic patterns.

Next we apply RoPE to the Query and Key vectors. This stamps each
query and key with its position information.

```
Q = RoPE(Q, seq_len=7)
K = RoPE(K, seq_len=7)
```

Now we compute the attention scores. The score between token i and
token j is the dot product of Query i with Key j.

```
scores = Q @ K^T / sqrt(64)
```

The result is a 7 by 7 matrix. Each row is a token acting as query. Each
column is a token acting as key. The value at row i column j is how much
token i wants to attend to token j.

```
scores matrix (before mask):

         The    cat    sat    on     the    mat    .
The      0.42   0.15   0.08  -0.03  -0.11   0.02  -0.18
cat      0.38   0.52   0.22   0.01  -0.05   0.09  -0.21
sat      0.21   0.78   0.15   0.31   0.22   0.28   0.05
on      -0.05   0.11   0.45   0.55   0.38   0.42   0.12
the     -0.09   0.03   0.21   0.48   0.61   0.35   0.08
mat     -0.12  -0.01   0.15   0.41   0.52   0.58   0.11
.       -0.22  -0.15  -0.08   0.12   0.18   0.22   0.48
```

Look at the row for sat (row index 2). It has a high score for cat
(0.78) and moderate scores for on (0.31) and mat (0.28). Sat wants to
pay attention to its subject and its prepositional phrase. It cares
less about itself (0.15) and the period (0.05). This pattern emerged
from the learned weights W_q and W_k and from the positional rotation.

Now we apply the causal mask. Tokens cannot see the future. The upper
right triangle of the matrix is set to negative infinity.

```
scores matrix (after mask):

         The    cat    sat    on     the    mat    .
The      0.42  -inf   -inf   -inf   -inf   -inf   -inf
cat      0.38   0.52  -inf   -inf   -inf   -inf   -inf
sat      0.21   0.78   0.15  -inf   -inf   -inf   -inf
on      -0.05   0.11   0.45   0.55  -inf   -inf   -inf
the     -0.09   0.03   0.21   0.48   0.61  -inf   -inf
mat     -0.12  -0.01   0.15   0.41   0.52   0.58  -inf
.       -0.22  -0.15  -0.08   0.12   0.18   0.22   0.48
```

After softmax negative infinity becomes zero. The scores become
attention weights that sum to one for each row.

```
attention weights matrix (after softmax):

         The    cat    sat    on     the    mat    .
The      1.00   0.00   0.00   0.00   0.00   0.00   0.00
cat      0.47   0.53   0.00   0.00   0.00   0.00   0.00
sat      0.18   0.35   0.10   0.00   0.00   0.00   0.00
on       0.08   0.10   0.22   0.25   0.20   0.15   0.00
the      0.05   0.06   0.10   0.19   0.28   0.17   0.00
mat      0.04   0.05   0.08   0.17   0.24   0.30   0.00
.        0.03   0.04   0.05   0.08   0.11   0.14   0.28
```

Look at the row for on (row index 3). It attends 25 percent to itself
and 22 percent to sat and 20 percent to the and 15 percent to mat. A
balanced distribution across the preceding tokens. The word on is a
preposition that connects everything around it. It needs context from
every nearby word.

Look at the row for The (row index 0). It attends 100 percent to itself.
There is nothing before it. The word The has no context. It must rely
entirely on its own meaning. This is always true for the first token in
every sequence.

Finally we use these weights to mix the Value vectors.

```
output = attention_weights @ V

For token sat (row 2):
new_sat = 0.18 × V_The + 0.35 × V_cat + 0.10 × V_sat
```

The new vector for sat now contains information from The and cat
weighted by how much sat cares about them. The original meaning of sat
is still there via self attention (10 percent) but it has been enriched
with context from the subject of the sentence.

This entire computation happens 12 times in parallel for 12 heads. Each
head has its own W_q W_k and W_v matrices. Each head learns different
attention patterns. After all heads have computed their outputs we
concatenate them back together into a single 768 dimensional vector and
project through a final linear layer.

```
all_heads = torch.cat([head_0, head_1, ..., head_11], dim=-1)  # 12 × 64 = 768
output = all_heads @ W_o  # 768 @ 768 = 768
```

The output projection W_o mixes information between heads. Each head
operated independently. Now they share their discoveries. The grammar
head tells the pronoun resolution head what it found. The position head
tells the semantic head about word distances. The mixed output is richer
than any single head's contribution.

---

## Part 7: RMSNorm . Keeping Numbers Under Control

Before attention and before the feed forward network we normalize the
input. Normalization keeps the numbers at a consistent scale as they
flow through dozens of layers.

We use RMSNorm. It is simpler and faster than the older LayerNorm. It
computes the root mean square of a vector and divides every element by
it. The result always has RMS equal to 1.0.

```
rms = sqrt(mean(x²))
output = x / rms × weight
```

The weight is a learned parameter. One weight per dimension. It starts
at 1.0 and learns during training. It lets the model amplify important
dimensions and suppress unimportant ones while keeping the overall
magnitude stable.

Without normalization the outputs of attention and feed forward layers
would grow without bound. After twelve layers some values might be a
thousand times larger than others. The softmax in the next attention
layer would become a one hot vector. Gradients would vanish. Training
would fail.

With normalization every layer gets clean well scaled inputs. The tower
of twelve blocks stays straight. The model trains smoothly.

---

## Part 8: SwiGLU . The Gated Feed Forward Network

After attention every token has mixed information from all other tokens.
But the mixing was linear. Attention is just a weighted sum. Weighted
sums are not enough to capture the complexity of language. We need non
linear processing.

The feed forward network provides this non linearity. It processes each
token independently with the same learned weights. Each token gets the
same transformation applied to its unique vector.

Our feed forward network uses SwiGLU. SwiGLU is a gated activation. It
splits the computation into two paths. One path produces values. The
other path produces gates. The gates control how much of each value
passes through.

```
h = input @ W₁  (768 → 3072)  # value path
g = input @ W₂  (768 → 3072)  # gate path
output = (SiLU(h) × g) @ W₃  (3072 → 768)  # combine and project
```

The expansion from 768 to 3072 gives the network room to transform
information. In the wider middle layer the network can represent more
complex patterns. The contraction back to 768 forces it to compress
those patterns into a dense representation.

The SiLU activation on the value path provides smooth non linearity.
Unlike ReLU which has a sharp corner at zero SiLU is smooth everywhere.
This makes gradients flow better during training. The gate path has no
activation. It can output any real number. A gate of zero blocks the
information. A gate of one passes it through unchanged. A gate of two
amplifies it. The model learns which inputs should be amplified and
which should be suppressed.

The gate learns context dependent filtering. When the token is a verb
the gate might amplify dimensions related to action and suppress
dimensions related to objects. When the token is a noun it might do the
opposite. The same network weights apply to every token but the behavior
differs because each token's vector leads to different gate values.

SwiGLU has three weight matrices instead of the two that a standard
feed forward network would have. The extra matrix is for the gate. This
adds about 28 million parameters to our model compared to a standard
FFN. Every one of those parameters contributes to better performance.
The gating mechanism is why SwiGLU outperforms ReLU and GELU at scale.

---

## Part 9: The Residual Connection . The Gradient Highway

Every sublayer has a residual connection. The attention output is added
to the attention input. The feed forward output is added to the feed
forward input.

```
x = x + attention(norm(x))
x = x + ffn(norm(x))
```

These plus signs are the most important operators in the entire model.
Without them deep transformers cannot be trained. The gradients would
vanish. The early layers would never learn.

Here is why. When the model makes a prediction and computes the loss it
sends a gradient backward through the network. This gradient tells each
weight how to change to reduce the loss. The gradient flows backward
through each layer in reverse order. At each layer it is multiplied by
the derivative of that layer's function. If the derivative is smaller
than one the gradient shrinks. After propagating backward through twelve
layers the gradient at the first layer is the product of eleven numbers
that are each less than one.

```
gradient_at_layer_1 = gradient_at_layer_12 × d₁ × d₂ × ... × d₁₁

If each derivative is 0.5:
gradient_at_layer_1 = gradient_at_layer_12 × 0.5¹¹
                    = gradient_at_layer_12 × 0.0005
```

The gradient at layer one is two thousand times smaller than the
gradient at layer twelve. The first layer receives almost no learning
signal. Its weights stay random. The model cannot train.

Residual connections fix this by providing a second path. The gradient
can flow backward through the sublayer like before. Or it can bypass the
sublayer entirely and flow straight to the input. The bypass path has a
derivative of exactly 1.0. Always. The gradient does not shrink.

```
With residual: output = input + sublayer(norm(input))
Derivative:    d(output)/d(input) = 1 + d(sublayer)/d(input)
```

The total derivative is 1 plus something. Even if the something is small
the 1 ensures the gradient never vanishes. After twelve layers the
gradient at layer one is at least as large as the gradient at layer
twelve. Every layer can learn.

This is why we can stack twelve blocks. Or twenty four. Or ninety six.
The gradient highway stays open regardless of depth. The only limit is
computational cost not trainability.

---

## Part 10: The Full Model . Putting It All Together

Let us assemble every piece into the complete model.

```python
class GPT(nn.Module):
    def __init__(self, config):
        self.token_embedding = nn.Embedding(vocab_size, d_model)  # Part 4
        self.layers = nn.ModuleList([
            TransformerBlock(d_model, num_heads)  # Parts 6-9
            for _ in range(num_layers)
        ])
        self.final_norm = RMSNorm(d_model)  # Part 7
        self.lm_head = nn.Linear(d_model, vocab_size)  # Output projection

    def forward(self, input_ids):
        # Part 4: Embed tokens
        x = self.token_embedding(input_ids)  # [batch, seq, 768]

        # Parts 6-9: Process through transformer blocks
        for layer in self.layers:
            x = layer(x)  # Each block contains attention + FFN + residuals

        # Part 7: Final normalization
        x = self.final_norm(x)

        # Output: Project to vocabulary
        logits = self.lm_head(x)  # [batch, seq, 50257]
        return logits
```

That is the entire model. About fifty lines of code. Every component we
discussed is inside those fifty lines. The embedding table from Part 4.
The stacked transformer blocks from Parts 6 through 9. The final
normalization from Part 7. The output projection that converts hidden
states back to vocabulary predictions.

The model takes a batch of token sequences as input. For each position
in each sequence it produces 50257 scores. One score for each possible
next token. The highest scoring token is the model's prediction for what
word comes next.

---

## Part 11: The Output . From Vectors to Words

The final layer of the model projects from 768 dimensions to 50257
dimensions. This is a simple linear transformation. Multiply by a weight
matrix of shape 768 by 50257.

```python
logits = x @ W_lm_head  # [batch, seq, 768] @ [768, 50257] = [batch, seq, 50257]
```

These 50257 numbers are called logits. They are unnormalized scores.
Higher means the model thinks that token is more likely. They are not
probabilities yet because they do not sum to one and some may be
negative.

To convert logits to probabilities we apply softmax.

```python
probs = softmax(logits)  # Each row now sums to 1.0
```

Each row of the probability matrix sums to one. Row i column j is the
model's estimated probability that token j comes next given the first
i plus 1 tokens of the input.

The model does not output just one token. It outputs a probability
distribution over all 50257 tokens. During training we compare this
distribution to the actual next token. During generation we sample from
this distribution to pick the next word.

---

## Part 12: The Loss . Measuring Wrongness

Training needs a number that tells us how good the model's predictions
are. Lower is better. The number is called the loss.

We use cross entropy loss. It measures the difference between the
model's predicted probabilities and the actual next tokens.

For a single prediction where the true next token is j:

```
loss = -log(probs[j])
```

If the model assigns probability 0.9 to the correct token the loss is
negative log of 0.9 which is 0.105. Good. The model was confident and
right.

If the model assigns probability 0.1 to the correct token the loss is
negative log of 0.1 which is 2.303. Bad. The model was confident about
the wrong things.

If the model assigns probability 0.01 to the correct token the loss is
negative log of 0.01 which is 4.605. Terrible. The model barely
considered the correct answer.

The loss is always positive. It approaches zero as the model becomes
perfect. It approaches infinity as the model becomes completely wrong.
A random model that assigns equal probability to all 50257 tokens would
have a loss of negative log of one over 50257 which is about 10.8. This
is the baseline. Any loss above 10.8 means the model is worse than
random. Any loss below 10.8 means the model has learned something.

```python
def compute_loss(logits, targets):
    # logits:   [batch, seq, 50257]
    # targets:  [batch, seq]  (shifted by 1 from input)
    logits_flat = logits.view(-1, 50257)
    targets_flat = targets.view(-1)
    return F.cross_entropy(logits_flat, targets_flat)
```

We compute the loss over all positions in all sequences in the batch.
The average loss across millions of predictions gives us a single number
that measures the model's performance. Every training step we try to
make this number smaller.

---

## Part 13: Backpropagation . Figuring Out What to Change

We have a loss. The loss tells us how wrong the model was. But it does
not tell us which of the 150 million weights to change or in which
direction. Backpropagation answers this question.

Backpropagation applies the chain rule from calculus. For every weight
in the model it computes the partial derivative of the loss with respect
to that weight. This derivative tells us: if I increase this weight by
a tiny amount how much will the loss change.

```python
loss.backward()  # PyTorch does all the calculus automatically
```

After this call every weight in the model has a .grad attribute. The
gradient is a tensor of the same shape as the weight. Each element in
the gradient is the direction and magnitude to change that weight to
reduce the loss.

```
If weight[i,j].grad = 0.003:
  Increasing weight[i,j] makes the loss go up.
  We should decrease it.

If weight[i,j].grad = -0.005:
  Increasing weight[i,j] makes the loss go down.
  We should increase it.

If weight[i,j].grad = 0.000:
  Changing this weight does not affect the loss.
  We can leave it alone or change it without consequence.
```

The gradients flow backward from the loss through the output projection
through the final normalization through each transformer block in
reverse order through the embedding table and back to the input. At
each step the chain rule multiplies local derivatives. The residual
connections ensure that gradients survive the journey.

---

## Part 14: Gradient Clipping . Preventing Wild Jumps

Sometimes a batch of text produces very large gradients. A rare word
pattern or an unusual sentence structure sends a shockwave through the
gradients. If we applied these large gradients directly the model's
weights would jump to a completely different configuration. Training
would be destroyed.

Gradient clipping prevents this. After the backward pass we check the
total magnitude of all gradients. If it exceeds a threshold we shrink
all gradients proportionally to fit under the threshold.

```python
total_norm = sqrt(sum(g.norm(2)² for g in gradients))
if total_norm > 1.0:
    scale = 1.0 / total_norm
    for g in gradients:
        g *= scale
```

The direction of the update is preserved. Only the step size is limited.
The model takes small safe steps instead of wild leaps. The threshold
of 1.0 is standard for transformer training. It was found empirically.
It catches dangerous spikes without interfering with normal updates.

---

## Part 15: AdamW . Updating the Weights

We have gradients for every weight. Now we need to apply them. The
simplest approach is to move each weight a tiny bit in the opposite
direction of its gradient.

```
weight = weight - learning_rate × gradient
```

This is stochastic gradient descent. It works but it is slow and
unstable. The learning rate is the same for every weight regardless of
how much each weight needs to change. Noisy gradients cause zigzagging.
Large weights receive no regularization.

AdamW improves on all three fronts. It maintains running averages of
past gradients and their magnitudes. It uses these averages to adjust
the step size for each weight independently. It applies weight decay
separately from the gradient update.

```python
# AdamW for a single weight
momentum = β₁ × momentum + (1 - β₁) × gradient      # Running average of gradients
velocity = β₂ × velocity + (1 - β₂) × gradient²      # Running average of squared gradients

# Bias correction for early steps
mom_corrected = momentum / (1 - β₁^step)
vel_corrected = velocity / (1 - β₂^step)

# Decoupled weight decay
weight = weight × (1 - lr × weight_decay)

# Gradient update
weight = weight - lr × mom_corrected / (sqrt(vel_corrected) + ε)
```

The momentum term acts like inertia. It smooths out noise by
maintaining a running average of past gradients. If the gradient
points in the same direction for many steps momentum builds up and the
step size increases. If the gradient oscillates momentum cancels out
and the step size decreases.

The velocity term adjusts per weight learning rates. Weights that have
been making large moves get smaller steps. Weights that have been still
get larger steps. This adaptive behavior means we do not need to tune
the learning rate for every weight individually.

The weight decay term pushes all weights toward zero by a tiny fraction
each step. This prevents weights from growing without bound. Large
weights are a sign of overfitting. The model has become too confident
about a few patterns and ignores everything else. Weight decay forces it
to stay humble.

The epsilon term prevents division by zero. It is tiny and never needs
tuning.

AdamW is the standard optimizer for language model training. GPT-3
trained with it. LLaMA trained with it. Every model in this guide
trains with it. The specific hyperparameters β₁ of 0.9 β₂ of 0.95 and
weight decay of 0.1 are the LLaMA defaults. They have been validated
on models from one billion to seventy billion parameters.

---

## Part 16: Cosine Warmup . The Learning Rate Schedule

The learning rate is not constant throughout training. It follows a
schedule that warms up then decays.

At the very start of training the model's weights are random. The
gradients are large and noisy. A large learning rate would send the
model flying off in random directions. We start with a learning rate
of zero and linearly increase it to the maximum over several thousand
steps. This is the warmup phase.

```python
if step < warmup_steps:
    lr = max_lr × step / warmup_steps
```

Once the model is stable we can train at full speed. But as training
progresses and the model gets closer to a good solution we need to be
more careful. Large steps would overshoot the minimum. We gradually
reduce the learning rate following a cosine curve.

```python
progress = (step - warmup_steps) / (total_steps - warmup_steps)
lr = min_lr + (max_lr - min_lr) × 0.5 × (1 + cos(π × progress))
```

The cosine curve starts falling slowly then faster in the middle then
slowly again at the end. This smooth decay is gentler than step decay
which drops the learning rate abruptly at fixed intervals. Abrupt drops
can disturb the model. Cosine decay is continuous.

At the very end of training the learning rate reaches a small minimum.
The model takes tiny steps that refine its weights with precision. The
trusty phase.

All three phases together make training both stable at the start and
precise at the end. Every modern language model uses this schedule.

---

## Part 17: Mixed Precision . Faster Training

The model's weights are stored as 32 bit floating point numbers. This
is the standard for scientific computing. Good precision and good range.

But most operations inside the forward pass do not need 32 bits of
precision. The matrix multiplications in attention and the feed forward
network work almost as well with 16 bits. Using 16 bits instead of 32
cuts memory usage in half and nearly doubles speed on modern GPUs.

We use a format called bfloat16. It has the same range as float32 but
less precision. The maximum representable number is the same in both
formats. So bfloat16 never overflows even during the largest matrix
multiplications. The only difference is that bfloat16 can only represent
about two decimal digits of precision instead of seven.

This tradeoff is perfect for neural networks. We need the range to
prevent overflow during intermediate computations. But we do not need
seven digits of precision for every activation. Two digits is enough
for the model to learn effectively.

```python
with torch.amp.autocast('cuda', dtype=torch.bfloat16):
    # Every operation here uses bfloat16 where safe
    logits = model(input_ids)
    loss = compute_loss(logits, targets)
```

The master weights are always stored in float32. Only the forward and
backward passes use bfloat16. The weight updates are applied in float32
to preserve precision over thousands of training steps.

Some operations stay in float32 because they need more precision.
Normalization layers need full precision to keep activations properly
scaled. The softmax in attention needs full precision for numerical
stability. Autocast handles these exceptions automatically. We do not
need to specify which operations to convert.

---

## Part 18: The Training Loop . Putting It All Together

We have every piece. The model. The data. The tokenizer. The optimizer.
The scheduler. The loss function. Now we assemble them into a training
loop.

```python
for step in range(max_steps):
    # 1. Get a batch of text
    batch = next(dataloader)
    input_ids, target_ids = batch

    # 2. Forward pass
    with autocast(use_amp):
        logits = model(input_ids)
        loss = cross_entropy(logits, target_ids)

    # 3. Backward pass
    loss.backward()

    # 4. Clip gradients
    clip_grad_norm(model.parameters(), max_norm=1.0)

    # 5. Update weights
    optimizer.step()
    optimizer.zero_grad()

    # 6. Update learning rate
    scheduler.step()

    # 7. Log progress
    if step % 100 == 0:
        print(f"Step {step}: loss = {loss.item():.4f}")
```

Seven steps. Repeated thousands or millions of times. Each repetition
the loss gets slightly smaller. The model gets slightly better. After
enough repetitions the model can generate coherent text.

The first few hundred steps are chaotic. The loss bounces around. The
gradients are large. The model is searching. Around step one thousand
the loss starts a steady decline. The model has found a good direction.
From then on progress is slow but consistent. Each step shaves a tiny
fraction off the loss. After fifty thousand steps the loss has dropped
from around 10.8 to somewhere between 2 and 3. The model can write
sentences that are sometimes grammatical and sometimes nonsensical. It
knows that periods end sentences and that capital letters start them.
It knows that the is often followed by a noun. It knows that cat and dog
can both sit and run and sleep.

After five hundred thousand steps the model writes paragraphs that are
mostly coherent. It still makes mistakes. It invents facts. It repeats
itself. But it has captured a remarkable amount of the structure of
English. All from predicting the next word billions of times.

---

## Part 19: Text Generation . The Model Speaks

Once the model is trained we want it to write something. We give it a
starting phrase called a prompt. The model reads the prompt and predicts
the first word after it. Then it takes the prompt plus that predicted
word and predicts the second word. It repeats until it has generated
enough text or until it predicts an end of text token.

```python
prompt = "The cat sat on the"
input_ids = tokenizer.encode(prompt)  # [464, 3797, 3332, 319, 262]

for _ in range(50):
    logits = model(input_ids)          # Predict next token
    logits = logits[:, -1, :]          # Only the last position

    probs = softmax(logits / temperature)
    next_token = sample(probs, top_k=50)

    input_ids = append(input_ids, next_token)
```

The sampling parameters control how the model picks the next token.
Without any parameters the model would always pick the single most
likely token. The output would be deterministic and often repetitive.
The same prompt would always produce the same completion. The model
would loop on common phrases.

Temperature adds randomness. It divides the logits by a number before
softmax. Low temperature makes the distribution sharper. The top token
gets even more probability. The output is focused and predictable. High
temperature flattens the distribution. Less likely tokens get more
chance. The output is creative and unpredictable.

```
Temperature 0.3: "The cat sat on the windowsill gazing at the birds outside."
Temperature 0.8: "The cat sat on the edge of the couch watching me with sleepy eyes."
Temperature 1.5: "The cat sat on the piano keys and composed a midnight melody."
```

Top-k limits the choices to the k most likely tokens. Everything else
gets zero probability. This prevents the model from ever picking a
completely nonsensical token. A value of 50 is common. It eliminates the
bottom 50207 tokens while keeping enough variety for interesting output.

Top-p is an adaptive version of top-k. Instead of always keeping k
tokens it keeps the smallest set of tokens whose cumulative probability
exceeds p. If the model is very confident it might keep only three
tokens. If the model is uncertain it might keep five hundred. This
adapts to the model's confidence at each step.

Together these three parameters give us fine control over the model's
output. They are the reason the same model can write both technical
documentation and poetry. The model provides the probabilities. The
parameters control how we sample from them.

---

## Part 20: KV Cache . Making Generation Fast

The naive generation loop is slow. Every time we append a new token we
recompute the entire sequence from scratch. Token 500 has already been
processed 499 times by the time we add token 501. Most of the
computation is redundant. The Key and Value vectors for the first 500
tokens do not change when we add token 501.

The KV cache eliminates this redundancy. We store the Key and Value
vectors for every token we have already processed. When a new token
arrives we compute its Key and Value and append them to the cache. We
do not recompute anything for the old tokens.

```
Without cache:  Step 1 computes K and V for 1 token.
                Step 2 computes K and V for 2 tokens.
                Step 3 computes K and V for 3 tokens.
                Total work: 1 + 2 + 3 + ... + N ≈ N²/2

With cache:     Step 1 computes K and V for 1 token.
                Step 2 computes K and V for 1 new token. Reuses old.
                Step 3 computes K and V for 1 new token. Reuses old.
                Total work: N
```

For a thousand token generation the KV cache is roughly a thousand times
faster. The memory cost is manageable for small models. For GPT-2 Small
the cache for a thousand tokens is about 35 megabytes. For GPT-3 Large
it would be about 4 gigabytes. For very large models at very long
context lengths the cache can become the dominant memory consumer.

---

## Part 21: What the Model Actually Learned

After training on billions of words the model has learned patterns that
are invisible to the untrained eye. It has not learned facts in the way
a database stores facts. It has learned statistical regularities. The
word sequence the cat sat on the is almost always followed by mat or
floor or chair or bed. The sequence the capital of France is almost
always followed by Paris. The model does not know what France or Paris
or capital mean. It only knows the probability distribution over next
words given all previous words.

The embeddings have organized themselves into a space with structure.
The vector for king minus the vector for man plus the vector for woman
is very close to the vector for queen. This was not programmed. It
emerged from training data where king and queen appeared in similar
contexts but with different gendered pronouns.

The attention heads have specialized. Some heads consistently attend to
the subject of the current verb. Others attend to recent nouns mentioned
in the sentence. Others attend to punctuation to understand sentence
boundaries. These specializations were not designed. They emerged from
the training objective of predicting the next word.

The feed forward networks have become pattern recognizers. One part of
the network might activate strongly when it sees a list of items because
commas between items predict more items. Another part might activate for
dates because the word in followed by a year predicts a specific
temporal pattern. These patterns are distributed across thousands of
neurons in ways that are difficult to interpret but mathematically
optimal for prediction.

---

## Part 22: Why This Matters

A machine that can predict the next word with high accuracy is a machine
that has implicitly learned the rules of language. Grammar. Syntax.
Semantics. Discourse structure. World knowledge. All of it is necessary
to make accurate predictions. The model must know that verbs agree with
their subjects in number. It must know that Paris is in France and that
France is in Europe. It must know that a sentence that starts with
although expects a contrasting clause. It must know that a recipe for
cake includes flour and sugar and eggs not motor oil and concrete.

The model acquires all this knowledge through a single task: predict the
next token. It is a simple task with profound implications. A system
that can predict what humans will write next is a system that has
compressed a significant fraction of human knowledge into a set of
matrix multiplications.

The transformer architecture made this possible. Before transformers
language models could only capture local patterns within a few words.
Recurrent networks forgot information that appeared more than a few
dozen words back. Attention changed that. Attention lets every word
interact with every other word regardless of distance. A word at the
end of a paragraph can attend to a word at the beginning as easily as
to the word right next to it.

The scale made this powerful. GPT-2 with 1.5 billion parameters could
write plausible paragraphs. GPT-3 with 175 billion parameters could
write plausible essays and answer questions and generate code. The jump
in capability came entirely from more data and more parameters. The
architecture stayed almost the same.

The latest generation of models adds instruction following. They are
trained not just to predict the next word but to predict the next word
in a helpful and harmless assistant's response. This additional training
makes the models useful as tools rather than just interesting as
demonstrations.

But underneath the chat interface and the instruction tuning and the
safety filters the core mechanism is unchanged. Tokens in. Attention
across. Feed forward through. Logits out. The same story we have traced
from beginning to end. The same mathematics. The same architecture. The
same gradient descent optimizing cross entropy loss one step at a time.

---

## Epilogue: What You Can Build Next

You have now seen every piece of a modern language model. You could
build one from scratch with the code in this guide. You could modify it.
Add more layers. Use a bigger dataset. Experiment with different
attention patterns. Swap SwiGLU for a different activation.

The architecture described here is not the final word. Research
continues. State space models like Mamba challenge the transformer's
dominance. Mixture of experts routes tokens through different sub
networks to scale more efficiently. Retrieval augmented generation
connects models to external knowledge bases. But the core ideas are
stable. Embeddings. Attention. Residuals. Normalization. Gradient
descent. These will be relevant for as long as neural networks exist.

You now understand them. Not just what they are. Why they are. Every
design choice in this architecture was made to solve a specific problem.
The residual connections solve vanishing gradients. RMSNorm solves
activation drift. SwiGLU solves the inflexibility of simple activation
functions. RoPE solves position encoding without parameters. Every
piece tells a story.

The story of modern AI is the story of many people over many years
solving one problem at a time and stacking their solutions into
something greater than the sum of its parts. You now know every part.
You can be one of those people.