A working museum of natural language processing

The NLP Journey

How we got from counting words to conversing with machines — told through 20 small programs you can run, plus concept pages for the era that outgrew the laptop.

Each stop below is a self-contained explainer. The Runnable ones come with commented code and a deep-dive README; the Concept ones cover the frontier era, defined by a scale no laptop can reach. They are ordered by idea, not strictly by year — each technique was a direct response to the limits of the one before, which is mostly but not always chronological (the Neural LM, 2003, precedes the RNN, 1990/1997, because it's the simpler idea; RAG, 2020→, follows Alignment, 2022, because it builds on an instructable model). Start anywhere, or read the full story in OVERVIEW.md. Recurring jargon has one careful home: the Glossary & References.

Rules vs. Statistics

1966

Before the statistics won, someone tried writing the rules by hand. Watching that approach break is the fastest way to understand why the rest of this journey counts instead.

1966Runnable
ELIZA
Reflect the words back and call it understanding.

Counting & Retrieval

1910s – 1990s

Language as statistics: what follows what, which words matter, which documents are relevant. No labels, no learning — just counts.

1913 / 1948Runnable
Markov Chain
Generate text from what word tends to follow what.
1948Runnable
N-gram Markov Chain
Wider multi-word context for more coherent text.
1948Runnable
Probability Markov Chain
Weight each next word by how often it follows.
1948Runnable
N-gram + Probability
Combine wider context with weighted selection.
1971Runnable
POS-Tagged Markov Chain
Steer the walk with grammar, not just adjacency.
1966–70sRunnable
HMM + Viterbi Tagger
Tag the whole sentence at once, so context can win.
1935–49Runnable
Zipf's Law
A few words do almost all the work — predictably.
1948 / 1951Runnable
Entropy & the Guessing Game
How many bits of surprise is a letter really worth?
1965Runnable
Edit Distance
Fewest edits between strings; a spell-checker.
1972Runnable
TF-IDF
Rank documents by how distinctive their words are.
1990Runnable
Pointwise Mutual Information
Which words co-occur more than chance? Collocations.
1990sRunnable
Naive Bayes
Supervised text classification by word likelihoods.
early 1990sRunnable
Co-occurrence Word Vectors
Know a word by the company it keeps.
1994 / 2016Runnable
Byte Pair Encoding
Build subword tokens by merging frequent pairs.

Learning Representations

2003 – 2017

Stop hand-counting; let a network learn the patterns. Embeddings, memory, and finally attention — the primitive behind everything modern.

2003Runnable
Neural Language Model
Stop counting words; learn what they mean.
2013Runnable
Word2Vec
Stop counting context. Predict it, and keep the weights.
1990 / 1997Runnable
Recurrent Neural Network
A hidden state that remembers as it reads.
2014Concept
seq2seq & the Bottleneck
Squeeze a sentence into one vector, then hit its limit.
2014–17Runnable
Attention
Let every token look directly at every other.

The Frontier

2017 → today

The era of scale — from a laptop’s ~17,600 training words to a frontier model’s ~15 trillion, about a billion-fold more. Mostly concept pages, since these artifacts cannot be trained on a laptop, bridging attention to the assistant reading this with you.

2017Concept
The Transformer
The full architecture built from attention.
2018Concept
Pretraining & Transfer Learning
Train once on raw text, then adapt to anything.
2020Concept
Scaling Laws & In-Context Learning
Capability becomes a predictable function of scale.
2022Concept
Alignment
Instruction tuning and RLHF: do what you ask.
2020→Runnable
Retrieval-Augmented Generation
Don't memorise the world — look it up.
2024–25Concept
Reasoning & Test-Time Compute
Think longer before answering.
2023→Concept
Tool Use & Agents
From what a model can say to what it can do.

The scoreboard

one corpus, one question

Four generations of generative model live in this museum, all trained on the same sonnets. Perplexity puts one number on each: at every position of a held-out text (every 5th sonnet, which the models never see in training), ask what probability the model gave the word that actually came next. A perplexity of 300 means it was, on average, as uncertain as if it were choosing among 300 equally likely words — lower is better. Entropy & the Guessing Game explains where the idea comes from.

ModelYearContextHeld-out perplexityIts idea
Word frequency (baseline)none721.7No model at all — just how common each word is.
Markov chain19481 word871.7Distinct followers, all equally likely.
Probability Markov chain19481 word784Followers weighted by how often they occurred.
N-gram Markov chain19482 words976.2Wider context, but followers still unweighted.
N-gram + probability19482 words892Wider context and weighted followers.
Neural language model20032 words992.5Learned embeddings, but only the top 200 words.
Character-level RNN1997whole prefix22,911Spells every word letter by letter (2.82 bits/character); no <unk> escape hatch.

Read it honestly: on 14,048 training words, nothing beats bare word frequency. That is not because context is useless — where a bigram's exact transition was seen in training (27% of positions), it scores 49 against the unigram's 234 at the very same spots. It is because most of what Shakespeare writes next, he has never written before: the wider the context, the rarer the exact match, and every model pays for confidence it hasn't earned. The neural model is graded on the whole stream while only speaking 200 words; the RNN must spell every rare word letter by letter (2.82 bits per character) while the word models pay one flat penalty per unknown word. The ideas were never wrong — they were starving. Feed the same next-word question a billion-fold more text and the numbers finally fall: that story continues at Scaling Laws. Method & code: node scripts/perplexity.js (deterministic; every 5th sonnet held out, Witten-Bell smoothing, shared vocabulary — the script's header states the full rules).