HMM + Viterbi Tagger — NLP Journey

pos-markov/'s baseline tagger looks at one word at a time: "rose" is tagged the same way everywhere, because the tagger is a pure function of a word's spelling. A Hidden Markov Model, decoded with the Viterbi algorithm, tags a whole sentence at once, so an ambiguous word gets resolved by its neighbours instead of by itself alone.

Two tables drive this: transition probabilities (how likely a Verb is to follow a Pronoun) and emission probabilities (how likely a specific word is, given the tag). The tags are "hidden" — you see the words, not the tags — so tagging means finding the tag sequence most likely to have produced the sentence. Trying every sequence is exponential; Viterbi finds the best one in linear time with the same dynamic-programming trick as edit-distance/'s matrix: build the best score reaching each (word, tag) cell from the best scores one position back, keep a backpointer, and read the winning path off the end.

Honest training data: this repo has no hand-annotated corpus. Transition counts come from auto-tagging the real sonnets with the baseline tagger — noisy per-word, but the sequence patterns of English survive. Emission counts add a handful of hand-written ambiguity seeds for classic ambiguous words ("rose", "light", "still") — without them the model could never learn "rose" is anything but a flower, since every real occurrence in the sonnets is exactly that.

tag ↓ / word →	they	rose
Noun	−11.55	−15.07
Verb	−10.83	−14.74
Adjective	−10.42	−19.38
Adverb	−10.61	−17.64
Pronoun	−6.77	−18.31
Determiner	−10.28	−17.85
Preposition	−10.87	−17.52
Conjunction	−10.68	−18.16

The real trellis for "they rose" — every cell is the best-path log-probability of reaching that (word, tag) pair. Orange is the winning path: Pronoun → Verb. The baseline tagger says "rose" is a Noun here, unconditionally. Viterbi says Verb, because the Verb cell in the "rose" column (−14.74) beats the Noun cell (−15.07) — a margin built entirely from context, since both tags start from the same word.

Try it

Where it falls short

The training data is synthetic where it matters most. The transition table is real; the disambiguating emission counts for "rose", "light", and "still" are hand-picked numbers, not observed frequencies. A tagger trained this way "knows" exactly the ambiguities its author thought to seed, and not one more.

Only one word of tag history. This is a bigram HMM — the next tag depends only on the immediately preceding tag, the same context-window limit every bigram Markov chain here has.

It still doesn't know what words mean. "Rose" resolves to Verb or Noun by pattern statistics over tags, not by understanding that one is a flower and the other a motion. Two unrelated ambiguous words with identical tag patterns would be resolved identically.

Smoothing is uniform and crude. Add-1 smoothing treats every unseen combination as equally unlikely, which isn't true. Modern taggers use better smoothing, or skip HMMs for neural sequence models that learn a continuous representation instead of a lookup table.

How it works

Try it

Where it falls short