Charles Dahab - Personal Blog

Speculative decoding uses a small draft model to propose k tokens; the large target model validates them in one pass. Acceptance checks compare draft logits with target logits under the chosen sampler. If the prefix is accepted, we commit all k tokens, amortizing KV writes and matmuls.

Key Details

Draft quality should slightly underestimate target entropy to avoid frequent rejections.
Calibrate with temperature/top‑p matched to the target.
Batch multiple sequences with shared draft steps to keep GPUs saturated.

// acceptance test (sketch)
accepted = true;
for (i=1..k) {
if (!within_tolerance(target.logp[t_i], draft.logp[t_i])) { accepted=false; break; }
}
if (accepted) commit(k); else commit(1);

Measure effective tokens/sec and KV bytes written/sec. Gains vanish if memory traffic dominates or if sampler constraints (e.g., JSON mode) reduce overlap.

Speculative Decoding: Draft Models, Assisted Generation, and Acceptance Kernels

Cutting latency by verifying multiple candidate tokens per step

Key Details