Charles Dahab - Personal Blog

RoPE embeds relative positions by rotating query/key feature pairs using complex phases that grow with index. Concretely, a 2D feature pair (x2i, x2i+1) is mapped via a rotation with angle θ(i, pos) = pos / base^(2i/d). Dot products become phase‑aware, enabling extrapolation beyond trained context when scaled carefully.

Context Extension in Practice

Position interpolation scales the effective position pos' = pos * α, compressing long sequences into the training range. NTK‑aware scaling adjusts RoPE base to preserve kernel geometry at large contexts. YaRN further blends multiple bases to stabilize very long contexts. Always re‑tune sampling (temperature, top‑p) after context extension—logit entropy drifts with altered kernels.

// RoPE scaling pseudo
float base = 10000.0;
float scale = 0.5; // compress 2x
θ = pos * scale / pow(base, (2*i)/d);
// apply rotation to q/k pairs before attention

Measure degradation via long‑range tasks (needle‑in‑a‑haystack, long‑QA) and watch head‑wise entropy; some heads specialize in recurrence and will saturate first.

Rotary Position Embeddings (RoPE): Mechanics, Scaling, and YaRN/NTK Tricks

Why complex exponentials beat learned vectors for long‑range attention

Context Extension in Practice