A working museum of natural language processing
How we got from counting words to conversing with machines — told through 20 small programs you can run, plus concept pages for the era that outgrew the laptop.
Each stop below is a self-contained explainer. The Runnable ones come with commented code and a deep-dive README; the Concept ones cover the frontier era, defined by a scale no laptop can reach. They are ordered by idea, not strictly by year — each technique was a direct response to the limits of the one before, which is mostly but not always chronological (the Neural LM, 2003, precedes the RNN, 1990/1997, because it's the simpler idea; RAG, 2020→, follows Alignment, 2022, because it builds on an instructable model). Start anywhere, or read the full story in OVERVIEW.md. Recurring jargon has one careful home: the Glossary & References.
Before the statistics won, someone tried writing the rules by hand. Watching that approach break is the fastest way to understand why the rest of this journey counts instead.
Language as statistics: what follows what, which words matter, which documents are relevant. No labels, no learning — just counts.
Stop hand-counting; let a network learn the patterns. Embeddings, memory, and finally attention — the primitive behind everything modern.
The era of scale — from a laptop’s ~17,600 training words to a frontier model’s ~15 trillion, about a billion-fold more. Mostly concept pages, since these artifacts cannot be trained on a laptop, bridging attention to the assistant reading this with you.
Four generations of generative model live in this museum, all trained on the same sonnets. Perplexity puts one number on each: at every position of a held-out text (every 5th sonnet, which the models never see in training), ask what probability the model gave the word that actually came next. A perplexity of 300 means it was, on average, as uncertain as if it were choosing among 300 equally likely words — lower is better. Entropy & the Guessing Game explains where the idea comes from.
| Model | Year | Context | Held-out perplexity | Its idea |
|---|---|---|---|---|
| Word frequency (baseline) | — | none | 721.7 | No model at all — just how common each word is. |
| Markov chain | 1948 | 1 word | 871.7 | Distinct followers, all equally likely. |
| Probability Markov chain | 1948 | 1 word | 784 | Followers weighted by how often they occurred. |
| N-gram Markov chain | 1948 | 2 words | 976.2 | Wider context, but followers still unweighted. |
| N-gram + probability | 1948 | 2 words | 892 | Wider context and weighted followers. |
| Neural language model | 2003 | 2 words | 992.5 | Learned embeddings, but only the top 200 words. |
| Character-level RNN | 1997 | whole prefix | 22,911 | Spells every word letter by letter (2.82 bits/character); no <unk> escape hatch. |
Read it honestly: on 14,048 training words, nothing beats bare word frequency. That is not because context is useless — where a bigram's exact transition was seen in training (27% of positions), it scores 49 against the unigram's 234 at the very same spots. It is because most of what Shakespeare writes next, he has never written before: the wider the context, the rarer the exact match, and every model pays for confidence it hasn't earned. The neural model is graded on the whole stream while only speaking 200 words; the RNN must spell every rare word letter by letter (2.82 bits per character) while the word models pay one flat penalty per unknown word. The ideas were never wrong — they were starving. Feed the same next-word question a billion-fold more text and the numbers finally fall: that story continues at Scaling Laws. Method & code: node scripts/perplexity.js (deterministic; every 5th sonnet held out, Witten-Bell smoothing, shared vocabulary — the script's header states the full rules).