# Cosine Warmup: The Learning Rate Schedule

## What is it

The cosine warmup schedule controls how the learning rate changes
during training. It starts low and rises to a peak. Then it
gradually falls following a cosine curve. By the end of training
the learning rate is very small and the model settles into a fine
minimum.

Think of it like learning to ride a bicycle. At first you go very
slowly. You wobble. You get a feel for the balance. Once you have
some stability you push harder and go faster. As you approach your
destination you slow down again to make a precise stop. You do
not sprint from the start and slam the brakes at the end.

Cosine warmup does the same thing for neural network training. The
model starts slow to find its balance. It accelerates to full
speed once stable. It decelerates at the end to land softly on
the best possible solution.

## Where is it used

The schedule controls the learning rate parameter inside the
optimizer. Every training step the schedule computes a new
learning rate and assigns it to the optimizer.

```python
scheduler = CosineWarmupScheduler(optimizer, warmup=2000, max_steps=100000)

for step in range(max_steps):
    loss = model(batch)
    loss.backward()
    optimizer.step()
    scheduler.step()  # Update learning rate every step
    optimizer.zero_grad()
```

## Why we need it

A constant learning rate seems simpler. Why not just pick one
value and train the whole way. Two reasons.

First early training is chaotic. The model's weights are random.
The gradients are large and noisy. Large learning rates at the
start can send the model flying off in random directions. The
warmup phase lets the model find its footing before taking large
steps.

Second late training is about precision. After thousands of steps
the model is close to a good solution. Large steps would overshoot
the minimum and bounce around it forever. The decay phase lets
the model take tiny careful steps to settle into the exact best
position.

A constant learning rate would be either too large for the start
or too small for the middle. Warmup plus decay is the only way to
have both stability at the start and precision at the end.

## When was it invented

Learning rate warmup was used for the original transformer in
2017. The authors noticed that training was unstable in the first
few thousand steps without it. Cosine decay was introduced around
the same time as an alternative to step decay schedules which
drop the learning rate abruptly at predetermined intervals. Step
decay works but the sudden drops can disturb the model. Cosine
decay is smooth and continuous. GPT-3 used cosine warmup. LLaMA
used cosine warmup. It is the standard for language model
training.

## How it works

The schedule has three phases. Each phase is a simple
mathematical formula.

### Phase 1: Linear warmup

The learning rate starts at zero and increases linearly to the
maximum value.

```python
if step < warmup_steps:
    lr = max_lr * step / warmup_steps
```

Example with warmup_steps of 2000 and max_lr of 0.0003:

```
Step 0:    lr = 0.0003 × (0 / 2000) = 0.0
Step 500:  lr = 0.0003 × (500 / 2000) = 0.000075
Step 1000: lr = 0.0003 × (1000 / 2000) = 0.00015
Step 2000: lr = 0.0003 × (2000 / 2000) = 0.0003
```

Every step the learning rate grows by the same tiny amount. No
sudden jumps. The model has two thousand steps to get stable
before it reaches full speed.

### Phase 2: Cosine decay

After warmup the learning rate follows a cosine curve from the
maximum down to a minimum.

```python
if step < max_steps:
    progress = (step - warmup_steps) / (max_steps - warmup_steps)
    cosine_decay = 0.5 × (1 + cos(π × progress))
    lr = min_lr + (max_lr - min_lr) × cosine_decay
```

The progress variable goes from zero to one over the remaining
steps. The cosine function creates a smooth S shape curve.

```
Step 2000:  progress = 0.0, cosine = 1.0, lr = 0.0003
Step 25000: progress = 0.23, cosine = 0.75, lr = 0.000225
Step 50000: progress = 0.49, cosine = 0.25, lr = 0.000075
Step 100000: progress = 1.0, cosine = 0.0, lr = 0.00001
```

The learning rate falls slowly at first then faster in the middle
then slowly again at the end. The minimum is usually 0.00001
which is thirty times smaller than the peak. This tiny rate at
the end lets the model refine its weights with extreme precision.

### Phase 3: Minimum

After max_steps the learning rate stays at the minimum forever.

```python
lr = min_lr
```

The model continues to learn but at a glacial pace. Each step
makes almost no difference. This is intentional. The model has
already learned everything it needs. The remaining steps just
polish.

## A tiny code example

```python
import math
import matplotlib.pyplot as plt

max_lr = 0.0003
min_lr = 0.00001
warmup_steps = 2000
max_steps = 100000

lrs = []
for step in range(max_steps):
    if step < warmup_steps:
        lr = max_lr * step / warmup_steps
    elif step < max_steps:
        progress = (step - warmup_steps) / (max_steps - warmup_steps)
        cosine = 0.5 * (1.0 + math.cos(math.pi * progress))
        lr = min_lr + (max_lr - min_lr) * cosine
    else:
        lr = min_lr
    lrs.append(lr)

plt.figure(figsize=(10, 4))
plt.plot(lrs)
plt.xlabel('Training Step')
plt.ylabel('Learning Rate')
plt.title('Cosine Warmup Schedule')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('lrs_curve.png', dpi=100)
print("The curve rises from 0 over 2000 warmup steps")
print("then decays along a cosine curve for 98000 steps")
print("and stays at the minimum from step 100000 onward")
```

## The shape of the curve

```
Learning rate
     ^
     |
0.0003 +         ....----....
     |        ..              ..
     |      ..                  ..
     |    ..                      ....
     |  ..                            .............
     |..                                            ...........
0.0  +----+----+----+----+----+----+----+----+----+----+---->
     0   10k  20k  30k  40k  50k  60k  70k  80k  90k  100k
                             Training steps
```

The curve rises steeply during warmup. It stays near the peak for
a while. Then it starts a gentle descent that accelerates in the
middle and flattens at the end. The minimum is reached exactly at
the final training step. Not before. Not after.

## What you need to remember

Cosine warmup scheduling controls the learning rate across the
entire training run. The rate starts at zero and warms up to a
peak. Then it decays along a cosine curve to a minimum. The
schedule is smooth and continuous with no sudden drops.

Warmup prevents instability at the start of training when
gradients are chaotic. Decay allows precision at the end when the
model is close to the solution. Together they make training both
stable and precise. Every modern language model uses this
schedule.