# Chapter 4 — Positional Encoding: Teaching Order ## The 5-Year-Old Analogy Consider two sentences: - "The **dog** bit the **man**." — scary - "The **man** bit the **dog**." — weird Same words, different order -> **completely different meaning**. But the Transformer reads all words **at once** (not one by one like humans do). It has **no idea** which word comes first! So we must **stamp each word with its position** before feeding it to the model. ## The Three Generations of Position Encoding | Method | How It Works | Pros | Cons | Used By | |---|---|---|---|---| | **Learned** | Each position gets its own learned vector | Simple, flexible | Can't handle sequences longer than training | GPT-2, BERT | | **Sinusoidal** | Fixed sine/cosine waves by position | Works for any length | Weaker at relative positions | Original Transformer | | **RoPE** | Rotates Q,K vectors by position angle | Perfect relative positions, any length | Slightly more complex | LLaMA, Mistral, Qwen, Gemma | | **ALiBi** | Adds a bias to attention scores based on distance | No learned params, very fast | Less expressive | BLOOM, MPT | ## Modern Approach: Rotary Position Embeddings (RoPE) Instead of **adding** position numbers to embeddings, RoPE **rotates** the query and key vectors by an angle that depends on position. ### The Math Intuition In 2D, rotating a vector `(x, y)` by angle `θ` gives: ``` x' = x*cos(θ) - y*sin(θ) y' = x*sin(θ) + y*cos(θ) ``` RoPE does this for EVERY pair of dimensions in the query and key vectors. The rotation angle for position `p` and dimension pair `2i, 2i+1` is: ``` θ(p, i) = p / (10000^(2i/d_model)) ``` **Key insight:** The angle depends on `p` (position) and `i` (dimension pair index). Lower dimension pairs rotate FAST (capturing local word relationships). Higher pairs rotate SLOW (capturing long-range relationships). ### Numerical Worked Example Let's trace RoPE with a tiny model: `d_model=4`, processing position `p=1`: **Step 1: Compute frequencies for each dimension pair** ``` Pair 0 (dims 0,1): freq = 1 / 10000^(0/4) = 1 / 1 = 1.000 Pair 1 (dims 2,3): freq = 1 / 10000^(2/4) = 1 / 10000^0.5 = 1 / 100 = 0.010 ``` **Step 2: Compute rotation angle for position p=1** ``` Pair 0 angle: θ₀ = p * freq₀ = 1 * 1.000 = 1.000 radian (≈ 57.3°) Pair 1 angle: θ₁ = p * freq₁ = 1 * 0.010 = 0.010 radian (≈ 0.57°) ``` **Step 3: Apply rotation to a query vector at position 1** ``` Before RoPE: q₁ = [0.8, 0.3, -0.5, 0.2] Rotate pair 0 (dims 0,1) by 57.3°: dim0' = 0.8*cos(1.0) - 0.3*sin(1.0) = 0.8*0.540 - 0.3*0.842 = 0.432 - 0.253 = 0.179 dim1' = 0.8*sin(1.0) + 0.3*cos(1.0) = 0.8*0.842 + 0.3*0.540 = 0.674 + 0.162 = 0.836 Rotate pair 1 (dims 2,3) by 0.57°: dim2' = -0.5*cos(0.01) - 0.2*sin(0.01) = -0.5*1.000 - 0.2*0.010 = -0.500 - 0.002 = -0.502 dim3' = -0.5*sin(0.01) + 0.2*cos(0.01) = -0.5*0.010 + 0.2*1.000 = -0.005 + 0.200 = 0.195 After RoPE: q₁' = [0.179, 0.836, -0.502, 0.195] ``` Now let's compute what happens at positions 1 and 3: ``` Position 1: θ₀ = 1.0 rad, θ₁ = 0.01 rad Position 3: θ₀ = 3.0 rad, θ₁ = 0.03 rad The dot product q₁ · k₃ will depend on the DIFFERENCE: Δθ₀ = 3.0 - 1.0 = 2.0 rad Δθ₁ = 0.03 - 0.01 = 0.02 rad This difference depends ONLY on (3-1)=2, the relative distance! Absolute positions don't matter — only how far apart they are. ``` This is why RoPE is brilliant: the attention score between position `i` and `j` depends **only** on their relative distance `(j-i)`, not their absolute positions. ### Why theta=10000? The base frequency `theta = 10000` controls the "spread" of frequencies: ``` Low theta (e.g., 100): - All dimension pairs rotate similarly - Model is more "position-agnostic" — better for long contexts - But loses fine-grained position resolution High theta (e.g., 100000): - Very different rotation speeds across dimensions - Better at distinguishing nearby positions - But struggles with very long contexts 10000 was found empirically to balance these tradeoffs. ``` ### Extending Context Beyond Training Length What if we trained on 2048 tokens but want to use 4096 at inference? **The problem:** RoPE was precomputed for positions 0-2047. Position 3000 was never seen. **Solutions:** | Method | How It Works | Quality | |---|---|---| | **Linear interpolation** | Position / scale (e.g., p/2 for 2x length) | OK, loses resolution | | **NTK-aware scaling** | Scale theta differently per frequency | Good | | **YaRN** | NTK + temperature scaling | Best (used in production) | | **Retrain** | Just train on longer sequences | Perfect but expensive | For our small training run, this doesn't matter — but know that it's a hot research area for production models. ## RoPE Code — Annotated ```python import torch import torch.nn as nn import math class RotaryPositionalEmbedding(nn.Module): """ WHAT: Rotary Position Embeddings (RoPE). WHY: Instead of ADDING position info to embeddings, we ROTATE Q and K vectors by position-dependent angles. The dot product q_i · k_j then depends ONLY on (j-i), which is exactly what attention should care about. Paper: "RoFormer" (Su et al., 2021) Used in: LLaMA 1/2/3, Mistral, Mixtral, Qwen 1/2, Gemma How it works at a glance: 1. For each pair of dimensions (0,1), (2,3), (4,5), ... 2. Rotate by angle = position * frequency 3. Lower dims rotate fast (local position) Higher dims rotate slow (global position) 4. The dot product naturally depends on relative distance """ def __init__(self, d_model: int, max_seq_len: int = 2048, theta: float = 10000.0): """ WHAT: Precompute rotation frequencies for fast lookup. Args: d_model: Head dimension (e.g., 64 for GPT-2). Must be even. max_seq_len: Precompute angles for positions 0..max_seq_len-1. theta: Base frequency. 10000 is standard. Controls the spread between fast and slow rotation frequencies. """ super().__init__() # WHAT: Verify d_model is even (must have pairs to rotate) assert d_model % 2 == 0, ( f"d_model ({d_model}) must be even for RoPE. " f"Each pair of dimensions needs a partner to rotate with." ) # WHAT: Create dimension indices: [0, 2, 4, ..., d_model-2] # WHY: Each pair (2i, 2i+1) gets the same rotation frequency. # We only need half the indices because pairs share. dim_indices = torch.arange(0, d_model, 2).float() # WHAT: Compute rotation frequencies # WHY: theta_i = 1 / (theta ^ (2i / d_model)) # # i=0: 1 / 10000^(0/64) = 1.0 → fast rotation (local) # i=30: 1 / 10000^(60/64) ≈ 0.0001 → slow rotation (global) # # This multi-scale approach means some dimensions # capture local word order while others capture # long-range position relationships. inv_freq = 1.0 / (theta ** (dim_indices / d_model)) # WHAT: Precompute angles for all positions # WHY: Computing cos/sin during training is expensive. # Precomputing them once and caching is 100x faster. positions = torch.arange(max_seq_len).float() # [0, 1, 2, ..., 2047] # WHAT: Outer product: each position x each frequency # freqs[p, i] = p * inv_freq[i] = angle for position p, dim pair i # Shape: [max_seq_len, d_model/2] freqs = torch.outer(positions, inv_freq) # WHAT: Duplicate to full dimension # WHY: Each dim pair (2i, 2i+1) gets the same angle, # so we copy each angle: [θ0, θ1, θ2, ...] -> [θ0, θ0, θ1, θ1, ...] emb = freqs.repeat_interleave(2, dim=-1) # [max_seq_len, d_model] # WHAT: Cache cos and sin for all positions # WHY: register_buffer means these move with model.to(device) # and are saved with model.state_dict(), but are NOT # trainable parameters (no gradients needed). self.register_buffer("cos_cached", emb.cos()) # cos of each angle self.register_buffer("sin_cached", emb.sin()) # sin of each angle @staticmethod def rotate_half(x: torch.Tensor) -> torch.Tensor: """ WHAT: Prepare a vector for the rotation formula. WHY: The rotation formula is: x' = x*cos + rotate_half(x)*sin For vector [x0, x1, x2, x3, x4, x5]: rotate_half returns [-x1, x0, -x3, x2, -x5, x4] Why this works: For pair (x0, x1) rotated by angle θ: x0' = x0*cos(θ) - x1*sin(θ) ← matches: x0*cos + (-x1)*sin x1' = x0*sin(θ) + x1*cos(θ) ← matches: x1*cos + (x0)*sin So executing (x*cos + rotate_half(x)*sin) performs rotation on every dimension pair simultaneously — no loop needed! """ x1 = x[..., : x.shape[-1] // 2] # First half: [x0, x2, x4, ...] x2 = x[..., x.shape[-1] // 2 :] # Second half: [x1, x3, x5, ...] return torch.cat([-x2, x1], dim=-1) # [-x1, x0, -x3, x2, -x5, x4, ...] def forward(self, x: torch.Tensor, seq_len: int) -> torch.Tensor: """ WHAT: Apply RoPE to queries or keys. Input: [batch, num_heads, seq_len, head_dim] x can be either Q or K (NOT V — values don't need position) Output: Same shape, rotated by position-dependent angles WHY applied only to Q and K: The attention score = Q_i · K_j controls WHICH values to attend to. We want this score to depend on relative position. The VALUE vectors carry content — position is irrelevant for the content itself. Position only matters for deciding which tokens to pay attention TO. """ # WHAT: Extract cos and sin for current sequence length # WHY: If seq_len=512 but max_seq_len=2048, we only need # the first 512 rows of the cached cos/sin tables. cos = self.cos_cached[:seq_len] # [seq_len, head_dim] sin = self.sin_cached[:seq_len] # [seq_len, head_dim] # WHAT: Add batch and head dimensions for broadcasting # WHY: cos/sin are [seq_len, head_dim]. We need them to # multiply with x [batch, heads, seq_len, head_dim]. # unsqueeze(0).unsqueeze(0) adds dims at positions 0 and 1: # [seq_len, head_dim] -> [1, 1, seq_len, head_dim] # Now they broadcast correctly over batch and heads. cos = cos.unsqueeze(0).unsqueeze(0) sin = sin.unsqueeze(0).unsqueeze(0) # WHAT: Execute rotation: x_rotated = x*cos(θ) + rotate_half(x)*sin(θ) # WHY: This is mathematically equivalent to applying a 2D rotation # matrix to each pair of dimensions, but implemented in pure # element-wise operations — much faster and parallelizable. return (x * cos) + (self.rotate_half(x) * sin) ``` --- **Previous:** [Chapter 3 — Embeddings](03_embeddings.md) **Next:** [Chapter 5 — Attention](05_attention.md)