# FAQ and Troubleshooting

## Training

### Q: My loss is stuck at 10.8 after thousands of steps. What is wrong

10.8 is the loss of a random model predicting uniformly over the
vocabulary. It equals ln(50257). If your loss stays at 10.8 your model
is not learning. Possible causes.

The learning rate is too low to move the weights. Try increasing from
3e-4 to 1e-3 temporarily and see if the loss moves.

The optimizer is not stepping. Check that `optimizer.step()` is being
called and that `optimizer.zero_grad()` is called after.

The gradients are zero. A bug in the loss computation or the backward
pass. Check that `loss.backward()` is called and that gradients are
flowing. Print `model.layers[0].attention.qkv_proj.weight.grad` after
backward. It should be non zero.

The data is wrong. Maybe input and target are identical or the targets
are all the same token. Print a few samples.

### Q: My loss is NaN. What happened

NaN means not a number. It means a number overflowed or divided by zero.
This is almost always caused by the learning rate being too high or
gradient clipping being missing.

Fix: lower the learning rate by 10x. Add gradient clipping with max
norm 1.0. Check that your loss function is computing correctly. Print
the logits before loss. If they contain NaN the problem is in the model
forward pass. If they are fine the problem is in the loss computation.

### Q: My loss decreases for a while then suddenly spikes

This is gradient explosion. A rare batch of data causes very large
gradients that shoot the model weights into a region where the loss
is huge.

Fix: add gradient clipping. `torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)`. This should be called after `loss.backward()` and before `optimizer.step()`.

### Q: Training is extremely slow on CPU. What can I do

CPU training is 10x to 50x slower than GPU. Options:

Use the tiny config (d_model=256, 4 layers). It trains in minutes on
CPU. The small config (d_model=768, 12 layers) will take days.

Use gradient accumulation to simulate larger batches without more
memory. But this does not speed up training. It only lets you use a
bigger effective batch size.

Use a cloud GPU. Google Colab provides a free T4 GPU. Run the colab
notebook for one click training.

Use Apple MPS if you have a Mac. Main.py now auto detects MPS and
enables mixed precision.

### Q: Do I need to train the full 50,000 steps

No. You can stop early. 500 steps on the tiny config gives a loss
around 6 to 7. The model will not be good but it proves the code works.
For the small config 5,000 steps shows obvious learning. 50,000 steps
is for production quality. Stop whenever you are satisfied with the
generated text.

### Q: How do I know if my model is overfitting

If the training loss keeps decreasing but the model generates
repetitive or nonsensical text it is overfitting. The model memorized
the training data instead of learning general patterns.

Fix: increase dropout (try 0.2 or 0.3). Increase weight decay (try
0.2). Reduce the number of training steps. Use a larger and more
diverse dataset.

## Generation

### Q: The model generates gibberish

This is normal for a randomly initialized model or a model trained for
very few steps. Even 500 steps on the tiny config produces mostly
gibberish. The model needs thousands of steps to produce coherent text.

If the model was trained for many steps and still produces gibberish
check that the tokenizer is the same one used during training. Using
a different tokenizer for generation than training produces garbage
because the token IDs mean different things.

### Q: The model repeats the same phrase over and over

This is a common problem called repetitive degeneration. The model
learns that repeating itself is a safe prediction because repeated
patterns are common in text.

Fix: increase the temperature to 0.8 or 1.0. Use top_k sampling with
k=50. Use top_p sampling with p=0.9. These prevent the model from
always picking the most likely token which is often a repetition.

### Q: How do I make the model more creative

Increase temperature to 1.2 or 1.5. Remove or increase top_k to 100.
Set top_p to 0.95. The model will pick less likely tokens more often
producing more varied output.

### Q: How do I make the model more factual

Decrease temperature to 0.3 or 0.5. Set top_k to 20. Use top_p of 0.5
to 0.7. The model will stick to its most confident predictions. More
accurate but less interesting.

### Q: What is the `<|endoftext|>` token I see in my output

This is the end of text marker. The model was trained with this token
between documents. During generation the model sometimes predicts this
token meaning it thinks the text should end. You can filter it out or
stop generation when it appears.

## Architecture

### Q: Why RoPE instead of learned positional embeddings

Learned positional embeddings cannot handle sequences longer than the
training length. If trained on 1024 tokens the model cannot process
2048 tokens. RoPE captures relative position so it generalizes to any
length. RoPE also has no learned parameters. Free improvement.

### Q: Why RMSNorm instead of LayerNorm

RMSNorm is mathematically simpler and about 15 percent faster. It
removes the mean centering and bias which experiments showed are
unnecessary. Every modern model uses RMSNorm.

### Q: Why SwiGLU instead of ReLU or GELU

SwiGLU has a gating mechanism. It learns which information to pass and
which to block. ReLU and GELU treat every input the same way. The gate
gives SwiGLU more expressive power per parameter. At large scale this
translates to better performance.

### Q: Why weight tying

The embedding layer and output layer do inverse operations. Embeddings
map token IDs to vectors. The output layer maps vectors to token
probabilities. Sharing the matrix saves 30 percent of parameters and
improves training because each token embedding gets gradient signals
from both directions.

### Q: Does our model use Flash Attention

No. Flash Attention is an optimized CUDA kernel that speeds up
attention by 2x to 4x. It does not change the math. Our implementation
uses standard PyTorch operations which are slower but understandable.
For production use you would swap in Flash Attention.

### Q: Does our model use grouped query attention

No. Grouped query attention reduces the number of key and value heads
relative to query heads. This saves memory in the KV cache during
inference. Our model uses standard multi head attention where Q K and
V all have the same number of heads.

## Hardware

### Q: Can I train on my laptop

Yes. The tiny config (256 dims, 4 layers, 17M params) trains in 2 to 5
minutes on a modern laptop CPU. The small config (768 dims, 12 layers,
152M params) takes hours to days on CPU. A GPU makes a huge difference.

### Q: What GPU do I need

Tiny config: any GPU or CPU.
GPT-2 Small (152M): 4GB VRAM minimum. 8GB comfortable.
GPT-2 Medium (350M): 8GB VRAM minimum. 12GB comfortable.
GPT-3 1.3B: 12GB VRAM minimum. 16GB comfortable.
GPT-3 6.7B: 24GB VRAM minimum.
GPT-3 175B: 8x A100 80GB.

### Q: How much memory does my model use

Rule of thumb: each parameter uses 2 bytes (bfloat16) for weights plus
8 bytes (float32) for optimizer states during training. Total is about
10 bytes per parameter for training.

```
17M params:  170 MB training memory
152M params: 1.5 GB training memory
7B params:   70 GB training memory (needs multiple GPUs)
```

## Bugs

### Q: PyTorch 2.6+ fails to load my checkpoint

PyTorch 2.6 changed `torch.load` defaults to `weights_only=True`.
Pass `weights_only=False` to load checkpoints containing custom classes
like GPTConfig. Our code handles this. If you use an older notebook
that does not have the fix add `weights_only=False` to your load call.

### Q: I get a shape mismatch error in attention

The most common cause is the sequence length and number of heads being
swapped. Our attention expects input as [batch, seq_len, d_model]
which is reshaped internally to [batch, num_heads, seq_len, head_dim].
If your input has num_heads where seq_len should be the broadcast fails.

### Q: My loss prints as 0.0000

The loss is probably computed as 0 divided by something. Check that
logits and targets have the correct shapes and that cross_entropy is
called correctly. Cross entropy expects logits of shape [N, vocab_size]
and targets of shape [N] with integer class indices.

### Q: I get `RuntimeError: expected scalar type Float but found Half`

This happens when mixed precision is enabled but some operation expects
float32. Use `torch.amp.autocast(device_type, enabled=True)` for the
forward pass and keep the loss computation in float32. The autocast
context manager handles dtype conversion for most operations.

### Q: The attention weights are all NaN after a few training steps

This is usually caused by the attention scores becoming too large
before softmax. Check that you are dividing by `sqrt(head_dim)` in the
attention score computation. Without this division the scores can be
large enough that `exp(score)` overflows to infinity.

### Q: I ran import tiktoken and got an error

Install tiktoken: `pip install tiktoken`. This is the tokenizer used
by GPT-2 and GPT-3. It is written in Rust and is very fast. If you get
a compilation error on installation try `pip install tiktoken --no-binary tiktoken` or upgrade your pip.