Markov Chain — NLP Journey

A Markov chain is a graph. Every unique word in the corpus becomes a node, and for each time word B follows word A in the text, a directed edge is recorded from A to B. To generate text, start at any word and at each step pick a random follower from the edges leaving the current node.

This is the first of five variations you'll meet on the same idea: Andrei Markov analysed letter sequences in 1913, and Claude Shannon's 1948 paper turned it into a language model of words. N-grams, probability weighting, and POS-tagging — the other four stops in this family — are all elaborations of that one 1948 idea, ordered here by how much they add, not by the year each variant happened to be written down.

The governing assumption is the Markov property: the next word depends only on the current word. Nothing before it matters. This makes the model trivially cheap to build — one pass through the text, one lookup table — and it already produces output that feels faintly language-like, because real language does have local regularities.

"from"

→

fairest the thee thyself highmost his heat youth mine far sullen me

Every word that ever follows "from" in the sonnets is a valid next step — 44 distinct words in all (12 shown). Each is selected with equal probability, 1 in 44, whether it followed "from" ten times in the corpus or once. The model has no preference for which is more natural.

The result is surprising and sometimes beautiful because Shakespeare's word transitions are themselves non-arbitrary. Even a uniform random walk through them produces something that sounds Shakespearean in tone, if not in sense.

Try it

Where it falls short

No memory. After choosing "fairest" as the word following "from", the model immediately forgets "from". The next word is chosen knowing only "fairest" — the full phrase built so far is invisible.

Uniform selection. "From thee" appears ten times in the corpus and "from fairest" once, yet the model treats them identically: each distinct follower is stored once and gets an equal share of the draw. How often a transition occurred is thrown away entirely.

No semantics. The model cannot tell the difference between a word that makes sense in context and one that does not. It only knows adjacency — which words have appeared next to which, never why.

One output mode. Markov chains can only generate. They cannot search, classify, summarise, or answer any question about what the text means.

How it works

Try it

Where it falls short