Byte Pair Encoding — NLP Journey

Every model up to now had to commit to a unit and accept its limits. Characters are universal but meaningless; words carry meaning but explode into a huge vocabulary with an endless tail of rare and never-seen forms. Byte Pair Encoding refuses the choice and learns its units from the data, building subword pieces that sit between the two extremes.

It started in 1994 as a data-compression trick — repeatedly replace the most common adjacent pair of bytes with a new symbol — and was repurposed for NLP tokenization in 2016. Begin by splitting every word into characters, with an end-of-word marker </w> appended. Then repeat one move: count every adjacent pair of symbols across the corpus (weighted by word frequency), merge the single most frequent pair into a new symbol everywhere, and record that as an ordered rule. The most common letter pairs fuse first, then those fuse into longer chunks, and gradually common word-endings and whole words emerge as single tokens.

The learned model is just that ordered list of merge rules. Applying them in order is exactly how text gets tokenized at inference time — and because any unfamiliar word can always fall back to smaller pieces and ultimately single characters, the out-of-vocabulary problem disappears.

"fairest" · characters, 0 merges8 tokens

f a i r e s t </w>

after 50 merges5 tokens

f a i re st</w>

after 150 merges (settles here)4 tokens

fa i re st</w>

after 300 merges (final)4 tokens

fa i re st</w>

A sample of merge rules learned from the sonnets (rule number, of 300 total)

#1 e </w>→e</w> (3471)

#2 t h→th (2824)

#7 i n→in (1299)

#9 o u→ou (984)

#17 an d</w>→and</w> (530)

Real output for "fairest" from a 300-merge run on Shakespeare's 154 sonnets (17,608 tokens, 3,170 unique words). Orange boxes are pieces built by merges. The endings "re" and "st</w>" form first, then the opening "fa" fuses, and the word settles at four tokens — between the 8-token character spelling and the single word-level token. Rule #17 above shows the whole word "and" falling out as one token, for free, once "an" and "d</w>" already exist (it's the 17th merge overall, not the 5th — this list skips the rules in between to keep it short).

Try it

Where it falls short

Merges are chosen by frequency, not meaning. The split fa · i · re · st</w> is statistically sensible but linguistically arbitrary — the real morphemes are fair + -est, and BPE recovers neither cleanly. Whether a tidy suffix like "est" or "ing" emerges as its own token is an accident of corpus frequency, not grammar.

The vocabulary is frozen after training. The merge list is fixed once learned. A domain shift — new jargon, another language, emoji — is handled only by shredding the unfamiliar text back into short pieces or single characters.

It depends on a pre-tokenization convention. BPE here runs inside words, on top of the shared whitespace/punctuation tokenizer and the </w> marker. Change how whitespace, casing, or punctuation are handled and you get different tokens. The units are not canonical.

It still carries no semantics. BPE decides what the units are, never what they mean. "fa", "re", and "st</w>" are opaque, ID-bearing strings. Turning these tokens into meaning is the job of the neural models that come next.

How it works

Try it

Where it falls short