# How to Read a Loss Curve

## The short answer

A loss curve is the single most important diagnostic for training a
language model. It tells you if the model is learning and if it is
learning well and if something is broken. A good curve goes down
smoothly. That is the happy path. Everything else is a problem.
This guide shows you every common pattern and what to do about it.

## The setup

You are training a model. Every 100 steps you record the loss. You
plot the steps on the x axis and the loss on the y axis. You want
the line to go down. But the shape of the line matters more than the
direction. Two training runs can both have decreasing loss but one
is doomed and the other is thriving.

All the examples below use starting loss around 10.8 which is the
loss of a random model predicting uniformly over GPT-2's 50257
token vocabulary. Loss equals negative log of one divided by 50257
which is approximately ln(50257) which is approximately 10.82.

## Pattern 1: The good curve

```
Loss
 10.8 |*
      | *
  9.0 |  *
      |   *
  7.0 |    *
      |     **
  5.0 |       ***
      |          ******
  3.0 |                ************
      +-----+-----+-----+-----+-----+-----+
      0    5k   10k   15k   20k   25k   30k
                    Steps
```

The loss starts around 10.8. It drops quickly in the first few
thousand steps. The drop slows as training continues. The curve is
smooth. No spikes. No plateaus. This is what successful training
looks like.

The early rapid drop is the model learning basic patterns. Capital
letters start sentences. Periods end them. Common words appear in
predictable positions. These patterns are easy to learn because
they are consistent. The loss drops fast.

The later slow descent is the model learning subtle patterns.
Subject verb agreement. Pronoun resolution. The difference between
affect and effect. These patterns are harder because they depend
on long range context and nuanced meaning.

The curve never plateaus completely because the model always has
something more to learn. There is always a slightly better set of
weights that predicts the next word a tiny bit more accurately.

## Pattern 2: The flat line

```
Loss
 10.8 |********************************************
      |
  8.0 |
      |
  5.0 |
      |
  2.0 |
      +-----+-----+-----+-----+-----+-----+
      0    5k   10k   15k   20k   25k   30k
```

The loss never moves. It stays at 10.8 forever. The model is not
learning at all. This is almost always a bug.

Possible causes in order of likelihood.

The learning rate is zero or the optimizer is not stepping. Check
that `optimizer.step()` is called. Check that `optimizer.zero_grad()`
is called AFTER step not before.

The gradients are zero. A bug in the loss computation. Check that
the logits and targets are correctly aligned. The logits should be
for predicting the next token. The targets should be the actual
next tokens.

The model weights are not updating. Check `weight.grad` after
`loss.backward()`. It should be non-zero for at least some weights.

The data is broken. Maybe input and target are identical. Maybe all
targets are the same token. Print a few batches and inspect them.

## Pattern 3: Loss spikes

```
Loss
 10.8 |*
      | *
  9.0 |  *
      |   *      /
  7.0 |    *    /
      |     ** /
  5.0 |    /  **
      |   /      ***
  3.0 |  /           ******
      +-----+-----+-----+-----+-----+-----+
      0    5k   10k   15k   20k   25k   30k
                    ↑
                   spike
```

The loss was decreasing nicely. Then it jumped up by a factor of
two or three. The model is still learning but something bad happened.

Cause: gradient explosion. A rare batch of training data contained
an unusual pattern. The gradients became very large. The weights took
a large step in a bad direction.

Fix: add gradient clipping. `torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)`. Call this after `loss.backward()` and before `optimizer.step()`. Our training code already does this.

If clipping is already in place the max norm might be too high. Try
reducing from 1.0 to 0.5. If the spikes persist lower the learning
rate.

A single spike is not fatal. The model usually recovers. But repeated
spikes are a sign that something is systematically wrong. The data
might contain corrupted examples. The learning rate might be too
high. The architecture might have a bug in how it handles certain
sequence lengths.

## Pattern 4: The plateau

```
Loss
 10.8 |*
      | *
  9.0 |  *
      |   *
  7.0 |    *
      |     *
  6.5 |      ******************************
      |
  3.0 |
      +-----+-----+-----+-----+-----+-----+
      0    5k   10k   15k   20k   25k   30k
               ↑
          plateau starts
```

The loss drops to around 6.5 and then stops. No matter how many more
steps you train the loss refuses to go lower. The model has hit a
wall.

This is called a learning rate plateau. The learning rate is too low
for the model to escape its current region of weight space. The
gradients are too small to move the model to a better configuration.

Fix: increase the learning rate. If you are using the cosine schedule
try increasing the peak learning rate from 3e-4 to 1e-3. If you are
already at the minimum of the schedule restart with a higher peak.

Alternative cause: the model capacity is exhausted. A model with 17
million parameters trained on 100 megabytes of text can only capture
so much. The loss cannot go below a certain floor determined by the
model size and data complexity. To push lower you need a bigger model
or more data or both.

## Pattern 5: Training loss goes down but validation loss goes up

```
Loss
 10.8 |*
  9.0 | *    Training (solid)
  7.0 |  *         *
      |   *       * *
  5.0 |    *     *   *
      |     *   *     *
  3.0 |      ***       ************* Validation (dashed)
      |        *                     **************
  1.0 |         *****                               ******
      +-----+-----+-----+-----+-----+-----+-----+-----+
      0    5k   10k   15k   20k   25k   30k   35k   40k
                   ↑
              overfitting starts
```

The training loss keeps decreasing. The validation loss was decreasing
too but then it started going up. The model is overfitting. It is
memorizing the training data instead of learning general patterns.

Fix: stop training at the point where validation loss starts rising.
This is called early stopping. You do not need to fix the model. You
just need to stop before it overfits.

Prevention: increase dropout from 0.1 to 0.2 or 0.3. Increase weight
decay from 0.1 to 0.2. Use a larger and more diverse training dataset.
The model cannot memorize data it has not seen enough times.

## Pattern 6: Loss goes negative or NaN

```
Loss
 10.8 |*
      | *
  9.0 |  *
      |   *
  7.0 |    *    |
      |     *   |
  5.0 |      *  |
      |       * |
  3.0 |        *|
      |         *
  0.0 +----------*--------*-------*------
  --------------------------------
  NaN  NaN  NaN  NaN  NaN  NaN  NaN
```

The loss was decreasing. Then it went to zero. Then it became NaN.
Training is dead. The model is producing infinities or NaN values.

Cause: numerical overflow. The logits or the loss computation produced
numbers too large for bfloat16 or float32 to represent. This happens
when the learning rate is too high and the weights blow up.

Fix: lower the learning rate dramatically. Add gradient clipping.
Check that you are dividing by `sqrt(head_dim)` in the attention
scores. Without this division the scores can become large enough that
`exp(score)` overflows bfloat16. Check that your loss computation is
correct. Cross entropy with logits of 1000 and targets produces NaN.

If using mixed precision check that the autocast context is set up
correctly. Some operations like softmax and layer norm should stay
in float32 for numerical stability. Autocast handles this but only
if the backend supports it.

## Pattern 7: The step function

```
Loss
 10.8 |*
      | *
  9.0 |******
      |      ******
  7.0 |           ******
      |                ******
  5.0 |                     ******
      |                          ******
  3.0 |                               ******
      +-----+-----+-----+-----+-----+-----+
      0    5k   10k   15k   20k   25k   30k
```

The loss drops suddenly at regular intervals. The drops coincide with
learning rate changes. The learning rate was reduced by a factor of
10 and the loss jumped down.

Cause: step decay learning rate schedule. Some training setups use
this intentionally. But the sudden drops can disturb the model's
momentum. The model took a while to adapt to the old learning rate
and now the rate changed abruptly.

Preference: use cosine decay instead of step decay. Cosine decay is
smooth. The model adapts continuously. No sudden jumps. Our training
code uses cosine warmup with cosine decay. No step changes.

## Pattern 8: The sawtooth

```
Loss
 10.8 |*  *  *  *
      | *  *  *  *
  9.0 |*  *  *  *
      | *  *  *  *
  7.0 |*  *  *  *  *  *
      | *  *  *  *  *  *
  5.0 |*  *  *  *  *  *  *  *
      | *  *  *  *  *  *  *  *
  3.0 |*  *  *  *  *  *  *  *  *  *
      +-----+-----+-----+-----+-----+
      0    5k   10k   15k   20k   25k
```

The loss bounces up and down every few steps. The overall trend is
downward but the noise is high.

Cause: batch size is too small. Each batch is too small to be
representative of the overall data distribution. Some batches are
easy and produce low loss. Others are hard and produce high loss.
The model overreacts to each batch.

Fix: increase the batch size or use gradient accumulation to simulate
a larger batch. Our training code uses gradient accumulation with
`grad_accum_steps=2`. Try increasing to 4 or 8. The effective batch
size increases without using more GPU memory.

The sawtooth is not necessarily a problem. Even with a good batch
size some noise is normal. The model learns the average gradient
direction over many batches. Individual batches can be noisy as long
as the average is correct.

## What you need to remember

A good loss curve is smooth and decreasing with a steep early drop
and a gradual later descent. No spikes. No plateaus. No NaN.

If your curve does not look like Pattern 1 something is wrong. The
most common fixes are gradient clipping and learning rate adjustment
and batch size increases and dropout increases. Try them in that
order.

Always save your loss values during training. A loss curve is a
debugging tool. Without it you are training blind. With it you can
diagnose most training problems in seconds by matching the pattern
to one of the eight patterns above.