1948 · Sequence Modelling
Weight each transition by how often it actually occurs in the source text.
The simple Markov chain picks the next word uniformly at random from all words that have ever followed the current one. But a word that appears as a follower 50 times in the corpus is clearly a stronger candidate than one that appears once. The probability-based model uses this information.
This, like every page in the Markov family, is a variation on Claude Shannon's 1948 n-gram idea (see Markov Chain), not a later invention — the five pages are ordered by how much each adds on top of the last, not by the year each variant happened to be written down.
After building the transition table, each word's count is divided by the total count of all followers for that key. This converts raw frequency into a probability distribution. Generation then uses a weighted random draw — rolling a number between 0 and 1 and walking the cumulative probability until the threshold is crossed. Common transitions are proportionally more likely to be taken.
Words following "the" in the sonnets, by probability
Each bar is the probability of that word being chosen as the next word after "the". "world's" wins the weighted draw most often, "sonnets" rarely. All followers still have a non-zero chance.
The generated text feels more faithful to the source because common phrasings recur proportionally. The model is not more intelligent — it still has no context, no memory, no semantics — but it has learned to prefer the probable over the merely possible.
Still no context. Probabilities are computed per single word, not per phrase. "the" → "world's" has a 7.7% probability regardless of everything that came before "the". The sentence under construction is invisible.
Frequency is not meaning. A rare but semantically correct word pair will always lose the weighted draw to a common but incoherent one. Probability faithfully reflects what was in the corpus — not what should follow in a given context.
Zero probability for unseen pairs. If two words never appeared adjacent in the training text, their transition probability is exactly zero — they can never be generated together, no matter how natural the combination might be.
Still only generation. Like its predecessors, this model can produce text but cannot make any statement about what text means, which document is relevant to a query, or how two words relate semantically.