2013 · Learned Embeddings
Stop counting context. Predict it, and keep the classifier's weights.
word-vectors/ represents a word as a row of raw co-occurrence counts. Word2Vec keeps the same distributional hypothesis but changes how the vector is produced: it trains a tiny classifier to predict context words from a target word, and the classifier's own weights become the word's vector.
Skip-gram with negative sampling turns this into a question a simple binary classifier can answer: for a given (target, context) pair, is this real or fake? Every word gets two small vectors, started as random noise. For each real pair pulled from a sliding window, the model pushes the pair's similarity toward 1 (the vectors move closer); for a handful of random fake pairs, it pushes similarity toward 0 (the vectors move apart).
Repeat millions of times and a word's vector drifts toward wherever the words it actually keeps company with cluster — no count table required. "Negative sampling" is what makes this cheap: score against one real neighbour and a few random decoys, not the entire vocabulary, at every step.
Nearest neighbours of "heart," same corpus, same 200-word vocabulary. Both lists agree on the meaningful ones — sight, thoughts, loves. But the counted list lets a function word ("and") leak in, because "and" co-occurs with everything; the learned list doesn't, because predicting "and" specifically from "heart" isn't easier than predicting it from a random decoy.
Word2Vec's most famous result is vector arithmetic: king − man + woman ≈ queen. This corpus doesn't have enough of those words. The best substitute is its most frequent gendered pair — pronouns. With this page's defaults (±5-word window, 80 epochs, seed 1), his − he + she lands closest to her. But the result does not survive small changes to those settings — a narrower window or a different epoch count can flip the winner entirely. The trick is real on this corpus; it just isn't stable, which is exactly why the famous version of this demo needed a corpus a million times larger to look effortless every time.
One vector per word, still. Exactly like word-vectors/, "light" the noun and "light" the adjective share one blended vector. Learning it instead of counting it does nothing to fix polysemy.
Fragile on a small corpus. The analogy result depends on window size and training length in ways that don't survive small changes — see above.
No interpretable dimensions. A count vector's columns are literal context words you can read. A learned vector's 16 numbers are wherever gradient descent happened to put them.
Needs many training pairs relative to parameters. This demo uses tens of thousands of (target, context) pairs to fit 6,400 parameters — the ratio needed only grows with vocabulary and dimensionality.