During prefill, we compute all keys/values for the prompt in parallel; during decode, we append one step at a time. The layout of cached tensors—[batch, heads, seq, head_dim] vs paged blocks—dominates memory bandwidth and fragmentation.
Paged KV stores attention states in fixed‑size blocks (e.g., 16–32 tokens) mapped via page tables per sequence. This enables context reuse, prefix sharing, and minimizes reallocations for variable sequence lengths. Align blocks to tensor core tile sizes to avoid bank conflicts.
Implementation Sketch
// pseudo for paged append
for (t in new_tokens) {
page = alloc_or_get(seq_id, t_page_idx);
page.K[head] [offset] = K_t;
page.V[head] [offset] = V_t;
}
// attention does indirect gather over page table
Monitor L2/TEX cache hit rates and DRAM reads per token. The fastest kernels colocate Q/K/V transforms, compute attention scores in shared memory, and fuse softmax + dropout + matmul to reduce passes.