Mixture‑of‑Experts (MoE): Routing, Capacity, and Load Balancing at Inference

Sparse activation for dense performance, without the tail latencies

Posted on September 10, 2025

MoE layers replace a single FFN with multiple experts gated per token. Top‑1/Top‑2 routing reduces FLOPs while preserving headroom for scale. However, capacity mismatch yields drop‑tok events and stragglers.

Serving Considerations

Co‑schedule experts to minimize cross‑GPU traffic. Use k‑balanced routing penalties and auxiliary losses; at inference, enable sticky routing (keep token streams on the same experts across steps) to improve cache locality. Profile NCCL all‑to‑all time vs GEMM time; if all‑to‑all > 25% of step time, repartition.

// simplified gating
scores = softmax(W_g * h);
route = topk(scores, k=2);
h_out = Σ_j gate_j * Expert_j(h);

Measure quality drift when switching to Top‑1 at inference; some tasks are robust, others lose calibration.