Speculative decoding uses a small draft model to propose k tokens; the large target model validates them in one pass. Acceptance checks compare draft logits with target logits under the chosen sampler. If the prefix is accepted, we commit all k tokens, amortizing KV writes and matmuls.
Key Details
- Draft quality should slightly underestimate target entropy to avoid frequent rejections.
- Calibrate with temperature/top‑p matched to the target.
- Batch multiple sequences with shared draft steps to keep GPUs saturated.
// acceptance test (sketch)
accepted = true;
for (i=1..k) {
if (!within_tolerance(target.logp[t_i], draft.logp[t_i])) { accepted=false; break; }
}
if (accepted) commit(k); else commit(1);
Measure effective tokens/sec and KV bytes written/sec. Gains vanish if memory traffic dominates or if sampler constraints (e.g., JSON mode) reduce overlap.