# AdamW: The Optimizer That Trains Language Models

## What is it

AdamW is the algorithm that updates the model's weights during
training. After computing how wrong a prediction was and
calculating which direction to move each weight AdamW decides
exactly how far to move. It does this intelligently based on the
history of past gradients for each parameter.

Think of it like hiking down a mountain in the fog. You cannot
see the bottom. You can only feel which direction is downhill.
You take a step. Then you feel again. A naive hiker always takes
the same size step. But some parts of the mountain are steep
and need big steps. Others are flat and need small steps. AdamW
remembers how steep each parameter has been and adjusts the step
size accordingly. It also remembers the general direction to keep
momentum going.

The W in AdamW stands for decoupled weight decay. This is the key
innovation over the original Adam optimizer. Weight decay slowly
pushes all weights toward zero to prevent them from growing too
large. In AdamW this push is separated from the gradient
calculation. The separation makes weight decay work correctly.

## Where is it used

AdamW is called every training step after backward propagation.
It takes the gradients that have been computed and clipped and
applies them to the model weights.

```python
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()  # AdamW updates weights here
optimizer.zero_grad()
```

## Why we use it instead of plain gradient descent

Plain gradient descent is simple. Move each weight in the
direction that reduces the loss. The step size is the same for
every weight.

```
weight = weight - learning_rate × gradient
```

This has three problems.

First the step size is fixed. A weight that needs a big change
gets the same size step as a weight that needs a tiny change. The
learning rate must be chosen for the most sensitive weights. This
makes training slow for all other weights.

Second there is no momentum. If the gradient is noisy and points
in a different direction each step the optimizer zigzags back and
forth making slow progress. Momentum smooths out the noise by
incorporating the direction from previous steps.

Third there is no weight decay. Without regularization weights
can grow arbitrarily large. Large weights mean the model is over
confident about some patterns and ignores others. The model
overfits.

AdamW solves all three problems.

## When was it invented

Adam was published in 2014 by Diederik Kingma and Jimmy Ba. It
quickly became the default optimizer for deep learning. But
researchers noticed that the weight decay implementation in Adam
was entangled with the adaptive learning rates. This meant weight
decay did not actually prevent large weights. It mostly just
slowed down training.

AdamW was proposed in 2017 by Ilya Loshchilov and Frank Hutter.
They showed that decoupling weight decay from the adaptive
learning rates fixed the problem. AdamW achieved better
generalization than Adam with the same hyperparameters. The fix
was simple but the impact was large. GPT-3 trained with AdamW.
LLaMA trained with AdamW. Every modern language model uses AdamW.

## How it works

AdamW maintains two running averages for each parameter. The first
is the momentum which tracks the average direction of recent
gradients. The second is the velocity which tracks the average
magnitude of recent gradients.

### Step 1: compute the noisy gradient

```python
gradient = compute_gradient(loss, weight)
```

This is the raw signal from one batch of data. It is noisy. A
single batch might give a misleading direction.

### Step 2: update the momentum

```python
momentum = beta1 × momentum + (1 - beta1) × gradient
```

The momentum is a weighted average of past gradients. Beta1 is
usually 0.9. This means recent gradients count for ninety percent
and older gradients fade away. The momentum smooths out noise and
gives a stable direction.

### Step 3: update the velocity

```python
velocity = beta2 × velocity + (1 - beta2) × gradient²
```

The velocity tracks how much each parameter has been moving.
Beta2 is usually 0.95. Parameters that have been making large
moves get a high velocity. Parameters that have been sitting
still get a low velocity.

### Step 4: bias correction

Both momentum and velocity start at zero. In the first few steps
they are biased toward zero. The bias correction fixes this.

```python
momentum_corrected = momentum / (1 - beta1^t)
velocity_corrected = velocity / (1 - beta2^t)
```

Where t is the current step number. After many steps the
correction becomes negligible. But in the first few steps it
prevents the optimizer from taking tiny useless steps.

### Step 5: decoupled weight decay

```python
weight = weight × (1 - learning_rate × weight_decay)
```

This shrinks every weight by a tiny fraction. Weight decay is
usually 0.1. With a learning rate of 0.0003 each weight is
multiplied by 0.99997 per step. Over thousands of steps this
gently pushes weights toward zero. Only weights that constantly
receive strong gradients survive. Weights that are not useful
fade away.

Note that this step happens before the gradient update and is
completely independent of the gradient. This is the decoupled
part of AdamW. In the original Adam weight decay was mixed in
with the gradient scaling which made it ineffective.

### Step 6: apply the gradient

```python
weight = weight - learning_rate × momentum_corrected / (sqrt(velocity_corrected) + eps)
```

The gradient step is scaled by the learning rate. Then it is
divided by the square root of the velocity. Parameters with high
velocity have been changing a lot so we take smaller steps.
Parameters with low velocity have been stable so we can take
larger steps. The epsilon prevents division by zero.

## Two parameter groups

Not all parameters should get weight decay. The biases and
normalization weights are one dimensional. They adjust the offset
and scale of activations. Pushing them toward zero would prevent
them from doing their job. We create two groups of parameters
with different weight decay values.

```python
def create_optimizer(model, config):
    decay_params = []      # Linear and embedding weights
    no_decay_params = []   # Biases and normalization weights

    for name, param in model.named_parameters():
        if not param.requires_grad:
            continue
        if param.dim() <= 1 or 'norm' in name.lower() or 'bias' in name:
            no_decay_params.append(param)
        else:
            decay_params.append(param)

    return torch.optim.AdamW([
        {'params': decay_params, 'weight_decay': 0.1},
        {'params': no_decay_params, 'weight_decay': 0.0},
    ], lr=3e-4, betas=(0.9, 0.95), eps=1e-8)
```

The decay group gets weight decay of 0.1. The no decay group gets
zero weight decay. Each group is treated separately by the
optimizer.

## The hyperparameters

Every optimizer has settings called hyperparameters. For AdamW
the important ones are:

```
learning_rate = 0.0003  (3e-4)
  How big a step to take. Smaller is safer but slower.

betas = (0.9, 0.95)
  How much to trust past gradients. Higher means smoother updates.

weight_decay = 0.1
  How aggressively to push weights toward zero. Higher prevents
  overfitting but too high makes the model forget.

eps = 0.00000001 (1e-8)
  A tiny number to prevent division by zero. Never needs tuning.
```

These values are the LLaMA defaults and have been battle tested
on models from one billion to seventy billion parameters. Unless
you are doing something unusual there is rarely a reason to
change them.

## What you need to remember

AdamW is the standard optimizer for training language models. It
combines momentum for stability adaptive learning rates for
efficiency and decoupled weight decay for regularization. The
three mechanisms work together to make training fast stable and
resistant to overfitting.

Every production language model trains with AdamW. The
hyperparameters are well established and rarely need tuning. Like
gradient clipping it has no meaningful downside. It is simply the
right tool for the job.