1948 · Sequence Modelling

N-Gram Markov Chain

Use a window of previous words as context, not just the last one.

How it works

The simple Markov chain suffers from amnesia — it forgets everything but the immediately preceding word. N-grams fix this by using the previous N−1 words as a joint key. Instead of asking "what follows from?", a trigram model asks "what follows from fairest?" — a much more specific question with a much more constrained answer.

This, like every page in the Markov family, is a variation on Claude Shannon's 1948 n-gram idea (see Markov Chain), not a later invention — the five pages are ordered by how much each adds on top of the last, not by the year each variant happened to be written down.

Building the model is the same as before: one pass through the corpus, recording for every context window which word follows it. The difference is that the key is now a phrase rather than a single word, and the lookup table grows exponentially with N.

Unigram context "from"
fairest the thee thyself highmost his heat youth mine far sullen me
Bigram context "from fairest"
creatures

One more word of context collapses 44 possible continuations (12 shown) down to 1. Longer context means more coherent output — but far fewer paths through the model.

This tradeoff is the central tension of n-gram models. A bigram generates varied, sometimes incoherent text. A 5-gram generates text that is locally perfect but mechanically reproduces phrases from the source. Somewhere in between is the sweet spot — and where that is depends entirely on how much text you have to train on.

Try it

Generate with n-gram context
Loading the corpus…

Raise N and the text tracks Shakespeare more closely — until it simply reproduces him, because each longer context has fewer (often only one) recorded continuations.

Where it falls short

Data sparsity. As N grows, most conceivable context windows never appear in the training corpus at all. A model trained on 154 sonnets will have seen almost no 5-gram contexts — the table is mostly empty, and generation constantly falls back to random restarts.

Fixed window. The context is always exactly N−1 words, no more. A pronoun referring to a noun 20 words earlier is completely invisible. The model cannot resolve long-range dependencies of any kind.

Still uniform within context. If "from fairest" → "creatures" appears once and "from fairest" → "things" appears ten times, they are still picked with equal probability. Frequency within the context is not yet used.

No generalisation. If the model has never seen a particular context, it has nothing to say. There is no mechanism to infer that an unseen phrase is similar to a seen one.