RoPE embeds relative positions by rotating query/key feature pairs using complex phases that grow with index. Concretely, a 2D feature pair (x2i, x2i+1)
is mapped via a rotation with angle θ(i, pos) = pos / base^(2i/d)
. Dot products become phase‑aware, enabling extrapolation beyond trained context when scaled carefully.
Context Extension in Practice
Position interpolation scales the effective position pos' = pos * α
, compressing long sequences into the training range. NTK‑aware scaling adjusts RoPE base to preserve kernel geometry at large contexts. YaRN further blends multiple bases to stabilize very long contexts. Always re‑tune sampling (temperature, top‑p) after context extension—logit entropy drifts with altered kernels.
// RoPE scaling pseudo
float base = 10000.0;
float scale = 0.5; // compress 2x
θ = pos * scale / pow(base, (2*i)/d);
// apply rotation to q/k pairs before attention
Measure degradation via long‑range tasks (needle‑in‑a‑haystack, long‑QA) and watch head‑wise entropy; some heads specialize in recurrence and will saturate first.