The NLP Journey — reference

Glossary & References

The recurring words of the journey, defined in plain language — and the papers behind every stop.

Every explainer in this museum defines its terms as it goes, so no page requires this one. But some words recur from stop to stop — token, embedding, parameter — and it helps to have them all in one place, said once, carefully. Two sentences each; no equations.

Terms

attention: A mechanism that lets every token in a sequence look directly at every other token and decide, with a learned weight, how much each one matters for the job at hand. It replaced the idea of squeezing a whole sentence through a single memory, and it is the core primitive of the Transformer. See it run: Attention.
backpropagation: The bookkeeping that makes neural networks trainable: after the network makes a prediction, the error is passed backwards through every layer to work out how much each parameter contributed to it. Those per-parameter blame assignments are the gradients that gradient descent then follows.
bag of words: Treating a text as an unordered pile of words — keeping what occurs and how often, discarding the order entirely. It sounds like vandalism, but for tasks like search and classification (TF-IDF, Naive Bayes) the pile alone carries a surprising amount of signal.
bit: The unit of information: one bit is the answer to one perfectly balanced yes/no question. Saying a letter carries 4.1 bits of entropy means pinning it down takes, on average, about four such questions. Where it comes from: Entropy & the Guessing Game.
context window: How much preceding text a model is allowed to see when predicting what comes next — one word for a simple Markov chain, two for a trigram model, hundreds of thousands of tokens for a modern assistant. Most of this journey is the story of making the usable context longer without the statistics falling apart.
corpus: The body of text a technique learns from — here, Shakespeare's 154 sonnets and Elizabeth Barrett Browning's 44 (corpora/). Latin for "body"; plural corpora, a word linguists refuse to give up.
cosine similarity: A score for how similar two vectors are in direction, ignoring their length: 1 means pointing the same way, 0 means unrelated. It is the standard way to ask "how alike are these two words?" once words have become vectors. In action: Co-occurrence Word Vectors.
embedding: A learned vector that stands for a word (or token), placed so that words used in similar ways end up near each other. Unlike a co-occurrence count, an embedding is learned — the network adjusts it during training until it is useful for prediction. Born at Neural LM, famous at Word2Vec.
entropy: The average surprise of a source, measured in bits — how unpredictable the next letter or word really is, given what you know. It is a ceiling on prediction: no model can beat the entropy of what it models. The full story: Entropy & the Guessing Game.
epoch (and iteration): One full pass through the training data; an iteration is one update step, usually on a small slice of it. "Trained for 15 epochs" means the network saw every training example 15 times.
gradient descent (and gradient): The gradient says, for each parameter, which direction of change would reduce the current error and by how much; gradient descent is simply taking a small step in that direction, millions of times. Nearly everything neural in this museum — and every frontier model — is trained this way.
held-out data: Text deliberately kept away from training, so a model can be graded on material it has never seen — the only honest test of whether it learned patterns rather than memorising examples. This repository holds out every 5th sonnet (Naive Bayes, the scoreboard); the gap between training and held-out scores is the classic symptom of memorisation.
hidden state: A vector a recurrent network carries with it as it reads, updated after every token — its working memory of everything so far. The word "hidden" just means it is internal: not input, not output. Watch one in motion: RNN.
loss: The single number training tries to shrink: how wrong the model's predictions currently are, averaged over the training data. For language models the loss is essentially surprise — low loss means the actual next word rarely surprises the model — which is why it converts directly into perplexity.
n-gram: A run of n consecutive tokens: "shall i" is a bigram (n = 2), "shall i compare" a trigram (n = 3). Counting which n-grams occur, and what follows them, was the workhorse of statistical language modelling for half a century. See N-gram Markov Chain.
parameter: One learned number inside a model — a knob training is allowed to turn. This repository's neural LM has about 17,000; a frontier model has hundreds of billions, which is what "large" in large language model counts.
perplexity: A model's average uncertainty, expressed as an effective number of choices: perplexity 50 means that, on average, the model is as unsure as if it were picking among 50 equally likely words. It is entropy un-logged, and lower is better. The journey's scoreboard ranks every generative stop by it.
probability distribution: A full set of options with a probability for each, summing to exactly 1 — "the / 0.21, a / 0.09, my / 0.07, …" is a distribution over next words. Every language model in this museum, from weighted Markov chain to frontier model, is at heart a machine for producing one of these.
smoothing: The art of never letting a statistical model say "impossible": reserving a little probability for words and transitions that never appeared in training, because the training data is a sample, not the language. Without it, one unseen word makes a whole text's probability zero. Used throughout Naive Bayes and the scoreboard.
softmax: The standard recipe for turning a row of raw scores into a probability distribution: exponentiate each score (so bigger differences matter more), then divide by the total so everything sums to 1. It is the last step of nearly every neural language model — the moment scores become a bet on the next word.
token (and tokenization): The unit a model actually reads — in this repository usually a lowercased word, in modern models a subword chunk like fa + irest. Tokenization is the (surprisingly consequential) act of cutting raw text into these units. Why the word isn't the obvious unit: Byte Pair Encoding.
Transformer: The 2017 neural architecture built almost entirely out of attention layers, processing every token in parallel instead of one at a time. Every frontier model — GPT, Claude, Gemini, Llama — is a Transformer at heart. The concept page: The Transformer.
vector: A plain list of numbers — [0.3, −1.2, 0.7, …] — treated as a point (or an arrow) in a space with one dimension per number. The great trick of modern NLP is representing words, sentences, and documents as vectors, so that "similar meaning" becomes the measurable "nearby in space".
vocabulary (and <unk>): The fixed set of tokens a model knows; anything outside it is out-of-vocabulary and gets mapped to a single stand-in token, written <unk> for "unknown". Small vocabularies are a real handicap — one of the honest asterisks on the scoreboard — and subword tokenization (BPE) was invented largely to abolish them.

References

The papers (and one novel-in-verse) behind the stops, in order of appearance on the timeline. None are required reading — every explainer stands alone — but each is where its idea entered the world.

A. A. Markov (1913). Analysis of letter sequences in Pushkin's Eugene Onegin — the original Markov chain. → Markov Chain
G. K. Zipf (1935, 1949). The Psycho-Biology of Language; Human Behavior and the Principle of Least Effort. → Zipf's Law
C. E. Shannon (1948). "A Mathematical Theory of Communication"; and (1951) "Prediction and Entropy of Printed English". → Entropy, and the n-gram idea behind the whole Markov family
V. I. Levenshtein (1965). "Binary codes capable of correcting deletions, insertions, and reversals". → Edit Distance
J. Weizenbaum (1966). "ELIZA — A Computer Program For the Study of Natural Language Communication Between Man And Machine". → ELIZA
A. Viterbi (1967). "Error bounds for convolutional codes and an asymptotically optimum decoding algorithm". → HMM + Viterbi Tagger
K. Spärck Jones (1972). "A statistical interpretation of term specificity" — IDF. G. Salton et al. (1975). "A vector space model for automatic indexing". → TF-IDF
K. Church & P. Hanks (1990). "Word association norms, mutual information, and lexicography". → PMI
J. Elman (1990). "Finding structure in time"; S. Hochreiter & J. Schmidhuber (1997). "Long Short-Term Memory". → RNN
I. Witten & T. Bell (1991). "The zero-frequency problem" — the smoothing used by the scoreboard.
P. Gage (1994). "A New Algorithm for Data Compression"; R. Sennrich et al. (2016). "Neural Machine Translation of Rare Words with Subword Units". → Byte Pair Encoding
M. Sahami et al. (1998). "A Bayesian approach to filtering junk e-mail". → Naive Bayes
Y. Bengio et al. (2003). "A Neural Probabilistic Language Model". → Neural LM
T. Mikolov et al. (2013). "Efficient Estimation of Word Representations in Vector Space". → Word2Vec
I. Sutskever et al. (2014). "Sequence to Sequence Learning with Neural Networks"; D. Bahdanau et al. (2014). "Neural Machine Translation by Jointly Learning to Align and Translate". → seq2seq, Attention
A. Vaswani et al. (2017). "Attention Is All You Need". → The Transformer
J. Devlin et al. (2018). "BERT: Pre-training of Deep Bidirectional Transformers"; A. Radford et al. (2018). "Improving Language Understanding by Generative Pre-Training". → Pretraining
J. Kaplan et al. (2020). "Scaling Laws for Neural Language Models"; T. Brown et al. (2020). "Language Models are Few-Shot Learners"; J. Hoffmann et al. (2022). "Training Compute-Optimal Large Language Models". → Scaling Laws
P. Lewis et al. (2020). "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks". → RAG
L. Ouyang et al. (2022). "Training language models to follow instructions with human feedback"; Y. Bai et al. (2022). "Constitutional AI: Harmlessness from AI Feedback". → Alignment