# Cheatsheet: Everything You Need in One Place ## Model sizes | Model | Layers | d_model | Heads | Head Dim | Params | |---|---|---|---|---|---| | Tiny (our default) | 4 | 256 | 4 | 64 | 17.1M | | GPT-2 Small | 12 | 768 | 12 | 64 | 124M | | GPT-2 Medium | 24 | 1024 | 16 | 64 | 350M | | GPT-2 Large | 36 | 1280 | 20 | 64 | 774M | | GPT-3 1.3B | 24 | 2048 | 32 | 64 | 1.3B | | GPT-3 6.7B | 32 | 4096 | 32 | 128 | 6.7B | | GPT-3 175B | 96 | 12288 | 96 | 128 | 175B | | LLaMA 7B | 32 | 4096 | 32 | 128 | 7B | | LLaMA 13B | 40 | 5120 | 40 | 128 | 13B | | LLaMA 70B | 80 | 8192 | 64 | 128 | 70B | ## Parameter count formula For our model with SwiGLU and weight tying: ``` Embedding: vocab_size × d_model Per block QKV: 3 × d_model × d_model Per block Output: d_model × d_model Per block SwiGLU: 3 × d_model × (4 × d_model) Per block Norms: 2 × d_model LM Head: 0 (weight tied with embedding) ``` For GPT-2 Small equivalent (768 dims, 12 layers, 50257 vocab): ``` 152M total = 38.6M (embedding) + 113.3M (12 blocks) + 768 (norm) ``` ## Training hyperparameters | Parameter | Our Default | Range | Notes | |---|---|---|---| | Learning rate | 3e-4 | 1e-5 to 5e-4 | Lower for fine-tuning (1e-5 to 5e-5) | | Weight decay | 0.1 | 0.01 to 0.3 | Only on 2D+ params. Not norms/biases | | Betas | (0.9, 0.95) | : | LLaMA defaults. Don't change | | Epsilon | 1e-8 | : | Never needs tuning | | Warmup steps | 2000 | 500 to 5000 | ~5% of total steps | | Max steps | 100K | Depends on data | More data = more steps | | Batch size | 8 (× 4 accum) | As large as fits | Effective batch = 32 | | Grad clip | 1.0 | 0.5 to 2.0 | 1.0 is standard | | Dropout | 0.1 | 0.0 to 0.3 | Higher = more regularization | ## Sampling parameters | Parameter | Default | Use Case | |---|---|---| | temperature | 0.8 | General. Lower = focused (0.3-0.5), higher = creative (1.2-1.5) | | top_k | 50 | Eliminate nonsense tokens. 0 = disabled | | top_p | 0.9 | Adaptive cutoff. 1.0 = disabled | | max_new_tokens | 100 | Depends on task. Longer for stories, shorter for answers | ## Key formulas ### Embedding ``` output = self.embed(token_ids) No scaling with RoPE (LLaMA convention) ``` ### RoPE rotation angles ``` θ_i = p / (10000^(2i / d_head)) cos_cached = cos(θ_i) for all p and i sin_cached = sin(θ_i) for all p and i x_rotated = x * cos + rotate_half(x) * sin ``` ### Attention ``` Q = input @ W_q [batch, heads, seq, head_dim] K = input @ W_k [batch, heads, seq, head_dim] V = input @ W_v [batch, heads, seq, head_dim] Q_rot = RoPE(Q, seq_len) K_rot = RoPE(K, seq_len) scores = Q_rot @ K_rot^T / sqrt(head_dim) [batch, heads, seq, seq] scores = masks(scores) (causal: upper triangle = -inf) weights = softmax(scores, dim=-1) (row sum = 1.0) output = weights @ V [batch, heads, seq, head_dim] concat heads → [batch, seq, d_model] output = concat @ W_o ``` ### RMSNorm ``` rms = sqrt(mean(x^2) + eps) output = x / rms * weight ``` ### SwiGLU FFN ``` h = SiLU(input @ W1) [batch, seq, 4×d_model] g = input @ W2 [batch, seq, 4×d_model] output = (h * g) @ W3 [batch, seq, d_model] ``` ### Cross entropy loss ``` loss = -log(P_model(true_token)) For random model: loss ≈ ln(vocab_size) = ln(50257) ≈ 10.82 ``` ### AdamW update ``` momentum = β1 × momentum + (1 - β1) × gradient velocity = β2 × velocity + (1 - β2) × gradient^2 m_hat = momentum / (1 - β1^step) v_hat = velocity / (1 - β2^step) weight = weight * (1 - lr * weight_decay) (decoupled) weight = weight - lr * m_hat / (sqrt(v_hat) + ε) ``` ### Cosine warmup schedule ``` if step < warmup: lr = max_lr * step / warmup else: progress = (step - warmup) / (total - warmup) lr = min_lr + (max_lr - min_lr) * 0.5 * (1 + cos(π * progress)) ``` ### Gradient clipping ``` total_norm = sqrt(Σ grad_i^2) if total_norm > max_norm: scale = max_norm / total_norm grad_i *= scale for all i ``` ## Memory requirements (approximate) | Model Size | bfloat16 Weights | Optimizer (AdamW) | Total (no batch) | |---|---|---|---| | 17M (tiny) | 34 MB | 136 MB | 170 MB | | 152M (GPT-2 small) | 304 MB | 1.2 GB | 1.5 GB | | 7B (LLaMA) | 14 GB | 56 GB | 70 GB | | 70B (LLaMA) | 140 GB | 560 GB | 700 GB | Optimizer states = 2 × params × 4 bytes (float32). Gradients = params × 4 bytes. Total training memory ≈ 2×weights + 2×optimizer + gradients + activations. With bfloat16 weights: ~8 bytes per parameter for full training. ## Shape conventions | Component | Input Shape | Output Shape | |---|---|---| | Tokenizer | text | [batch, seq] | | Embedding | [batch, seq] | [batch, seq, d_model] | | RoPE | [batch, heads, seq, head_dim] | Same | | Attention (internal Q/K/V) | [batch, heads, seq, head_dim] | Same | | Attention (external) | [batch, seq, d_model] | [batch, seq, d_model] | | RMSNorm | [batch, seq, d_model] | Same | | SwiGLU | [batch, seq, d_model] | Same | | TransformerBlock | [batch, seq, d_model] | Same | | LM Head | [batch, seq, d_model] | [batch, seq, vocab_size] | | Loss | logits + targets | scalar | ## What a good loss looks like ``` 10.8: Random. Model knows nothing 9.0: Starting to learn word frequencies 7.0: Learning basic grammar 5.0: Coherent phrases emerging 3.0: Decent sentences. Still makes mistakes 2.0: Good model. Plausible text 1.5: Very good. Near production quality <1.0: Overfitting or memorization (on small datasets) ``` ## Dataset sizes | Dataset | Documents | Tokens | Size | |---|---|---|---| | WikiText-103 | 28,475 | 103M | 516 MB | | BookCorpus | ~11,000 | 985M | 5 GB | | C4 | 364M | 156B | 305 GB | | The Pile | : | 825B | 825 GB | ## Common error messages | Error | Meaning | |---|---| | `CUDA out of memory` | Batch too big or model too big. Reduce batch size or use gradient accumulation | | `nan in loss` | Learning rate too high or gradients exploded. Lower LR or add gradient clipping | | `loss = 10.82` | Model is random. Normal at step 0. If it stays there check optimizer and loss function | | `size mismatch` | Shape error. Check batch/seq/head dimensions. Common bug in reshape/permute | | `weights_only load failed` | PyTorch 2.6+ requires `weights_only=False` for checkpoints with custom classes | ## Key files in the repo ``` 📦 how-to-train-your-gpt/ ├── main.py ← Single file training. Run this. ├── requirements.txt ← Dependencies ├── chapters/ ← 12 chapter textbook ├── notebooks/ ← 8 chapter notebooks + attention viz + colab ├── fine-tuning/ ← 7 files on fine-tuning + LoRA notebook ├── explanations and examples WIP/ ← 18 topic deep dives │ ├── attention.md ← Most popular. 363 lines, worked example │ ├── the_complete_story.md ← 1166 lines. Everything connected │ ├── a_tokens_journey.md ← 301 lines. Follow one sentence │ └── ... (15 more topics) └── README.md ```