early 1990s · Distributional Semantics

Count-Based Word Vectors

Turn each word into the company it keeps — then measure meaning as an angle.

How it works

Every technique so far counted words in isolation or in pairs. Word vectors take the next step: represent each word as a vector describing all the company it keeps. Slide a small symmetric window across the text and, for each target word, tally how often every other word falls inside that window. Stack those tallies and a word becomes a point in a high-dimensional space whose axes are the other words of the vocabulary.

The justification is the distributional hypothesis: words used in similar contexts tend to mean similar things. The linguist J. R. Firth gave it its famous slogan:

You shall know a word by the company it keeps. — J. R. Firth, 1957

To turn "similar company" into a number, compare the direction of two vectors with cosine similarity — the cosine of the angle between them. It ignores vector length and looks only at direction, so a rare word and a common word can still count as close if they keep the same kind of company. A cosine of 1 means identical context profiles; 0 means nothing in common.

cos(a, b)  =  (a · b) / (|a| × |b|) dot product of the two vectors, divided by the product of their lengths
nearest to "heart"
love0.857
mind0.848
sight0.834
thoughts0.818
loves0.810
and0.792
is0.780
nearest to "beauty"
part0.762
in0.759
proud0.751
will0.737
with0.727
and0.719
truth0.715

Real cosine similarities from Shakespeare's sonnets (top 200 words, ±3 window). For "heart" the method works: mind, sight, thoughts — the other inner faculties — rise to the top. For "beauty" it mostly surfaces function words (part, in, with): raw counts let ubiquitous words dominate. Faded bars are noise; the cosines also cluster tightly (0.71–0.76), a sign the small corpus barely separates them.

The same comparison, stated as a contrast: cos("heart","mind") = 0.848 while cos("heart","time") = 0.583. The related pair sits measurably closer in the 200-dimensional space than the unrelated pair — and nothing about that judgement required a dictionary or a single human label, only co-occurrence counts.

Try it

Find a word's nearest neighbours by cosine
Loading the corpus…

Where it falls short

High-dimensional and sparse. Every vector has one dimension per vocabulary word, and most cells are zero — most word pairs never co-occur. The representation is wasteful and grows with the vocabulary.

Raw counts let frequent words dominate. There is no PMI weighting, so and, is, and the co-occur with everything and inflate every similarity. That is why the neighbours of "beauty" are noisy. The companion PMI subproject weights each co-occurrence by how surprising it is; here we vectorise the raw counts.

The vocabulary is fixed. Only the top 200 words get vectors. Rare words — which carry most of the distinguishing signal — have no vector at all.

One vector per word, so polysemy is unresolved. "bank" (river) and "bank" (money) collapse into a single blended vector that is the average of both senses and faithful to neither.

A small corpus gives noisy estimates. The sonnets are only ~17,600 tokens, so many counts are 0, 1, or 2 and a single coincidence can swing a score. Distributional methods only stabilise on large corpora.