2013 · Learned Embeddings

Word2Vec

Stop counting context. Predict it, and keep the classifier's weights.

How it works

word-vectors/ represents a word as a row of raw co-occurrence counts. Word2Vec keeps the same distributional hypothesis but changes how the vector is produced: it trains a tiny classifier to predict context words from a target word, and the classifier's own weights become the word's vector.

Skip-gram with negative sampling turns this into a question a simple binary classifier can answer: for a given (target, context) pair, is this real or fake? Every word gets two small vectors, started as random noise. For each real pair pulled from a sliding window, the model pushes the pair's similarity toward 1 (the vectors move closer); for a handful of random fake pairs, it pushes similarity toward 0 (the vectors move apart).

positive: (heart, sight) really co-occurred → push toward 1
negative: (heart, unto) paired at random → push toward 0

Repeat millions of times and a word's vector drifts toward wherever the words it actually keeps company with cluster — no count table required. "Negative sampling" is what makes this cheap: score against one real neighbour and a few random decoys, not the entire vocabulary, at every step.

word2vec (learned)
loves0.833
sight0.770
eye0.688
dear0.688
thoughts0.663
word-vectors (counted)
love0.857
mind0.848
sight0.834
thoughts0.818
and0.792

Nearest neighbours of "heart," same corpus, same 200-word vocabulary. Both lists agree on the meaningful ones — sight, thoughts, loves. But the counted list lets a function word ("and") leak in, because "and" co-occurs with everything; the learned list doesn't, because predicting "and" specifically from "heart" isn't easier than predicting it from a random decoy.

Try it

Train skip-gram with negative sampling, live
Loading the corpus…

The analogy, honestly

Word2Vec's most famous result is vector arithmetic: king − man + woman ≈ queen. This corpus doesn't have enough of those words. The best substitute is its most frequent gendered pair — pronouns. With this page's defaults (±5-word window, 80 epochs, seed 1), his − he + she lands closest to her. But the result does not survive small changes to those settings — a narrower window or a different epoch count can flip the winner entirely. The trick is real on this corpus; it just isn't stable, which is exactly why the famous version of this demo needed a corpus a million times larger to look effortless every time.

Where it falls short

One vector per word, still. Exactly like word-vectors/, "light" the noun and "light" the adjective share one blended vector. Learning it instead of counting it does nothing to fix polysemy.

Fragile on a small corpus. The analogy result depends on window size and training length in ways that don't survive small changes — see above.

No interpretable dimensions. A count vector's columns are literal context words you can read. A learned vector's 16 numbers are wherever gradient descent happened to put them.

Needs many training pairs relative to parameters. This demo uses tens of thousands of (target, context) pairs to fit 6,400 parameters — the ratio needed only grows with vocabulary and dimensionality.