early 1990s · Distributional Semantics
Turn each word into the company it keeps — then measure meaning as an angle.
Every technique so far counted words in isolation or in pairs. Word vectors take the next step: represent each word as a vector describing all the company it keeps. Slide a small symmetric window across the text and, for each target word, tally how often every other word falls inside that window. Stack those tallies and a word becomes a point in a high-dimensional space whose axes are the other words of the vocabulary.
The justification is the distributional hypothesis: words used in similar contexts tend to mean similar things. The linguist J. R. Firth gave it its famous slogan:
You shall know a word by the company it keeps. — J. R. Firth, 1957
To turn "similar company" into a number, compare the direction of two vectors with cosine similarity — the cosine of the angle between them. It ignores vector length and looks only at direction, so a rare word and a common word can still count as close if they keep the same kind of company. A cosine of 1 means identical context profiles; 0 means nothing in common.
Real cosine similarities from Shakespeare's sonnets (top 200 words, ±3 window). For "heart" the method works: mind, sight, thoughts — the other inner faculties — rise to the top. For "beauty" it mostly surfaces function words (part, in, with): raw counts let ubiquitous words dominate. Faded bars are noise; the cosines also cluster tightly (0.71–0.76), a sign the small corpus barely separates them.
The same comparison, stated as a contrast: cos("heart","mind") = 0.848 while cos("heart","time") = 0.583. The related pair sits measurably closer in the 200-dimensional space than the unrelated pair — and nothing about that judgement required a dictionary or a single human label, only co-occurrence counts.
High-dimensional and sparse. Every vector has one dimension per vocabulary word, and most cells are zero — most word pairs never co-occur. The representation is wasteful and grows with the vocabulary.
Raw counts let frequent words dominate. There is no PMI weighting, so and, is, and the co-occur with everything and inflate every similarity. That is why the neighbours of "beauty" are noisy. The companion PMI subproject weights each co-occurrence by how surprising it is; here we vectorise the raw counts.
The vocabulary is fixed. Only the top 200 words get vectors. Rare words — which carry most of the distinguishing signal — have no vector at all.
One vector per word, so polysemy is unresolved. "bank" (river) and "bank" (money) collapse into a single blended vector that is the average of both senses and faithful to neither.
A small corpus gives noisy estimates. The sonnets are only ~17,600 tokens, so many counts are 0, 1, or 2 and a single coincidence can swing a score. Distributional methods only stabilise on large corpora.