# RoPE — Rotary Position Embeddings

## What is it

RoPE is a way to tell a language model *where* each word sits in
a sentence. Without it the model sees all words at once and has
no idea which word came first. RoPE stamps every word with its
position by giving it a tiny rotation. Words at the start get a
small spin. Words later in the sentence get a bigger spin. The
model can look at how much two words are rotated and figure out
their distance.

## Where is it used

RoPE lives inside the attention layer. Specifically it is applied
to the query and key vectors right before the dot product that
decides how much two words should pay attention to each other.

```
Input tokens
  → Embedding (word meanings)
    → Attention (where RoPE happens)
      → Transformer Block output
```

## Why use it

Before RoPE people used other tricks to mark word positions. Some
added position numbers to the word vectors. Others let the model
learn position from scratch. Both worked but had limits. Learned
positions could not handle sentences longer than training. Added
position numbers did not capture the relative distance between
words well.

RoPE fixes both problems. It captures relative distance perfectly.
Word five and word seven are always two steps apart no matter if
they appear at the start or the middle of a long paragraph. And
RoPE can handle any sentence length even if the model was trained
on shorter ones. This is why LLaMA Mistral and Qwen all use RoPE.

## When was it invented

RoPE was published in 2021 by a team of researchers in a paper
called RoFormer. It took a few years to catch on but by 2023
every major open source language model had switched to RoPE.

## How it works in simple terms

Imagine a clock with only one hand. At position zero the hand
points straight up. At position one the hand rotates a little.
At position two it rotates a little more. Each position gets a
unique angle. The model stores these angles as cosine and sine
values so it never has to compute them during training.

Now every word has a secret pair of numbers. RoPE takes that pair
and rotates it by the angle for that position. After rotation two
words that are close together will have similar rotations. Two
words far apart will have very different rotations. When attention
looks at the dot product between a query and a key the result
depends on how far apart they are. Not on their absolute position.

## A tiny code example

```python
import torch
import math

# Set up RoPE for a tiny model with 4 dimensions
d_model = 4
max_seq_len = 16
theta = 10000.0

dim_indices = torch.arange(0, d_model, 2).float()
inv_freq = 1.0 / (theta ** (dim_indices / d_model))
positions = torch.arange(max_seq_len).float()
freqs = torch.outer(positions, inv_freq)
emb = freqs.repeat_interleave(2, dim=-1)

cos_cached = emb.cos()
sin_cached = emb.sin()

# Pretend we have a query vector for a word at position 0
q = torch.tensor([0.8, 0.3, -0.5, 0.2])

seq_len = 4
cos = cos_cached[:seq_len]
sin = sin_cached[:seq_len]

# Apply rotation for position 0
rotated = q * cos[0] + torch.tensor([-0.3, 0.8, -0.2, -0.5]) * sin[0]
print(f"Position 0: {rotated.tolist()}")

# Apply rotation for position 2
rotated = q * cos[2] + torch.tensor([-0.3, 0.8, -0.2, -0.5]) * sin[2]
print(f"Position 2: {rotated.tolist()}")

print()
print("Same word at different positions gets different rotations.")
print("The model uses this difference to understand word order.")
```

## What you need to remember

RoPE rotates vectors. The rotation angle depends on position. The
dot product between two rotated vectors depends only on how far
apart they are. This is what attention should care about. Not
where the words are. But how far they are from each other.

RoPE is free. No learned parameters. No extra memory. No speed
penalty. It works for sequences of any length. Every modern
language model uses it.