Charles Dahab - Personal Blog

Tokenizer design is not an implementation detail; it is part of the model. Byte‑Pair Encoding (BPE) merges frequent pairs to form a subword vocabulary, SentencePiece operates directly on raw bytes and can use BPE or Unigram LM, while Unigram models select a probabilistic subword inventory that maximizes likelihood. Each choice shifts the distribution of token counts per input, directly influencing effective context length, training stability, and downstream latency.

Byte‑level BPE (BBPE) avoids OOV issues by tokenizing any string into bytes, but increases average token count for non‑Latin scripts; Unigram LM typically yields better compression for morphologically rich languages. Whitespace handling (e.g., prepended space tokens) affects boundary detection and merge statistics; small mistakes here degrade perplexity and cause pathological merges.

Building a Reproducible Tokenizer

Always commit the exact vocab + merges (or model file for Unigram) with a frozen normalization pipeline (NFKC, lowercasing, control‑char policies). Train with a representative, deduplicated corpus; otherwise, merges overfit to artifacts. Evaluate with tokenizer‑aware deduplication: different tokenizers produce different shingles and hash distributions.

// Unigram training sketch
// input: deduped text, target_vocab: 50k, coverage: 0.9995
train_unigram(
corpus,
target_vocab=50000,
byte_fallback=true,
normalization="NFKC+strip_control",
shrink_threshold=0.999,
reserved=["<pad>","<bos>","<eos>","<unk>"])

Benchmark token length distribution per language, and re‑tokenize eval sets when comparing models. A 6–10% token savings at equal semantic coverage often translates to double‑digit throughput gains in production.

Inside Tokenization: BPE, SentencePiece, Unigram and the Perils of Whitespace

Design trade‑offs that decide perplexity, context length, and downstream performance

Building a Reproducible Tokenizer