1971 · Sequence Modelling

POS-Tagged Markov Chain

Steer the random walk with grammar, not just adjacency.

How it works

An ordinary Markov chain remembers only which word followed which. A part-of-speech (POS) tagger first labels every word with its grammatical role — noun, verb, adjective, determiner, and so on — and the chain's state becomes the pair (word, role) rather than the bare word. Now "light" the noun and "light" the adjective are different nodes, and the transitions the model learns are as much about grammar as about vocabulary.

This is the last and most elaborate of the Markov family's five variations on Claude Shannon's 1948 n-gram idea (see Markov Chain) — the pages are ordered by how much each adds, not by year. The 1971 date above marks a different lineage: the early statistical POS-tagging work this baseline gestures toward, which the "Where it falls short" section names directly.

Tagging here is deliberately tiny and dependency-free: a lexicon of the closed-class function words (the, thou, and, upon…) that English barely ever adds to, plus a handful of suffix rules (-ly→adverb, -ing/-ed→verb, -ous/-ful→adjective…), defaulting to noun. This is the classic baseline tagger every fancier model is measured against — small enough to read in one sitting, and wrong just often enough to be instructive.

"the"Determiner
worldNoun×19 timeNoun×9 veryAdverb×6 dayNoun×5 sunNoun×5 otherNoun×5 worldsNoun×4 beautyNoun×3

What follows the determiner "the" in Shakespeare's sonnets, by part of speech. Almost everything is a noun — exactly the grammar a determiner demands. The chain has learned that pattern without ever being told the rule.

Because each state carries a grammatical role, the walk drifts along plausible role sequences — determiner → adjective → noun → verb — even when the individual word choices are nonsense. The output reads as more syntactically coherent than a plain Markov chain, at the cost of a tagging step and a much larger state space.

Try it

Generate from the POS-tagged chain
Loading the corpus…
Explore a state — type a word to see its part of speech and what can follow it

Where it falls short

Still single-step memory. The role helps locally, but the chain still looks only one state back. "the → fair → ?" is decided knowing "fair" alone; the determiner that made an adjective likely is already forgotten.

The tagger is a baseline. Lexicon-plus-suffix tagging has no notion of context, so it cannot tell a word's role from its neighbours. "rose" is always a noun to it, even in "they rose"; "-ed" words are always verbs, even the adjective "learned". Real taggers use the surrounding tags to decide.

Grammar isn't meaning. A perfectly well-formed determiner→adjective→noun sequence can still be semantic nonsense ("the sluttish time"). The model respects shape, never sense.

Sparser data. Splitting every word by role multiplies the number of states, so each one is seen fewer times — more dead-ends and more reliance on random restarts than the plain chain.