The NLP Journey

Each stop below is a self-contained explainer. The Runnable ones come with commented code and a deep-dive README; the Concept ones cover the frontier era, defined by a scale no laptop can reach. They are ordered by idea, not strictly by year — each technique was a direct response to the limits of the one before, which is mostly but not always chronological (the Neural LM, 2003, precedes the RNN, 1990/1997, because it's the simpler idea; RAG, 2020→, follows Alignment, 2022, because it builds on an instructable model). Start anywhere, or read the full story in OVERVIEW.md. Recurring jargon has one careful home: the Glossary & References.

The scoreboard

one corpus, one question

Four generations of generative model live in this museum, all trained on the same sonnets. Perplexity puts one number on each: at every position of a held-out text (every 5th sonnet, which the models never see in training), ask what probability the model gave the word that actually came next. A perplexity of 300 means it was, on average, as uncertain as if it were choosing among 300 equally likely words — lower is better. Entropy & the Guessing Game explains where the idea comes from.

Model	Year	Context	Held-out perplexity	Its idea
Word frequency (baseline)	—	none	721.7	No model at all — just how common each word is.
Markov chain	1948	1 word	871.7	Distinct followers, all equally likely.
Probability Markov chain	1948	1 word	784	Followers weighted by how often they occurred.
N-gram Markov chain	1948	2 words	976.2	Wider context, but followers still unweighted.
N-gram + probability	1948	2 words	892	Wider context and weighted followers.
Neural language model	2003	2 words	992.5	Learned embeddings, but only the top 200 words.
Character-level RNN	1997	whole prefix	22,911	Spells every word letter by letter (2.82 bits/character); no <unk> escape hatch.

Read it honestly: on 14,048 training words, nothing beats bare word frequency. That is not because context is useless — where a bigram's exact transition was seen in training (27% of positions), it scores 49 against the unigram's 234 at the very same spots. It is because most of what Shakespeare writes next, he has never written before: the wider the context, the rarer the exact match, and every model pays for confidence it hasn't earned. The neural model is graded on the whole stream while only speaking 200 words; the RNN must spell every rare word letter by letter (2.82 bits per character) while the word models pay one flat penalty per unknown word. The ideas were never wrong — they were starving. Feed the same next-word question a billion-fold more text and the numbers finally fall: that story continues at Scaling Laws. Method & code: node scripts/perplexity.js (deterministic; every 5th sonnet held out, Witten-Bell smoothing, shared vocabulary — the script's header states the full rules).

Rules vs. Statistics

Counting & Retrieval

Learning Representations

The Frontier

The scoreboard