1948 · Sequence Modelling
The statistical ceiling of sequence modelling: context and frequency, combined.
This model applies probability weighting to n-gram context keys, combining the two refinements made to the basic Markov chain. For each sequence of N−1 words seen in the corpus, it records every word that followed it and how many times — then converts those counts into a probability distribution. Generation uses a weighted draw over that distribution, conditioned on the current context window.
This, like every page in the Markov family, is a variation on Claude Shannon's 1948 n-gram idea (see Markov Chain), not a later invention — the five pages are ordered by how much each adds on top of the last, not by the year each variant happened to be written down.
The context window slides forward one word at a time. At each step, the last N−1 generated words form the key, and the next word is sampled from the probability distribution stored for that key.
Context window (size 2) sliding through generation
Key: "fairest creatures" → predicting next word:
"fairest creatures" appears only once in the sonnets — followed only by "we". Probability collapses to certainty.
Small context (n=2)
Many keys seen, rich distributions. Output is varied but can feel random.
Large context (n=5)
Most keys seen only once. Output is locally perfect — reproducing Shakespeare verbatim.
This tradeoff is irreducible within the statistical paradigm. More context always means less generativity.
This is the most faithful statistical approximation of the training text's surface patterns. It represents the practical ceiling of what pure sequence statistics can do without introducing learned representations or semantic knowledge.
Wider context and probability weighting together: more coherent than either alone, but the larger the context the more it just echoes the source.
Hits the ceiling. Beyond a context size of 3 or 4 on a small corpus, the model simply reproduces the source. There is no path from here to generalisation — the only way to improve output quality within this paradigm is to add more training data.
Cannot handle unseen contexts. If the N−1 words generated so far have never appeared as a sequence in the training corpus, the model has no next-word distribution at all and must restart randomly. This happens constantly with large N.
Word identity is opaque. The model has no idea that "beauty" and "fairness" are related, or that "winter" and "summer" are antonyms. Each word is a string token and nothing more. Substituting a synonym produces a completely different — and potentially empty — context key.
Cannot understand, only imitate. The model produces output that mimics the surface statistics of Shakespeare, but it has learned nothing about what the sonnets mean, who they address, or what themes they share. It cannot answer any question about the text.