Count-Based Word Vectors

Every technique so far counted words in isolation or in pairs. Word vectors take the next step: represent each word as a vector describing all the company it keeps. Slide a small symmetric window across the text and, for each target word, tally how often every other word falls inside that window. Stack those tallies and a word becomes a point in a high-dimensional space whose axes are the other words of the vocabulary.

The justification is the distributional hypothesis: words used in similar contexts tend to mean similar things. The linguist J. R. Firth gave it its famous slogan:

To turn "similar company" into a number, compare the direction of two vectors with cosine similarity — the cosine of the angle between them. It ignores vector length and looks only at direction, so a rare word and a common word can still count as close if they keep the same kind of company. A cosine of 1 means identical context profiles; 0 means nothing in common.

nearest to "heart"

love0.857

mind0.848

sight0.834

thoughts0.818

loves0.810

and0.792

is0.780

nearest to "beauty"

part0.762

in0.759

proud0.751

will0.737

with0.727

and0.719

truth0.715

Real cosine similarities from Shakespeare's sonnets (top 200 words, ±3 window). For "heart" the method works: mind, sight, thoughts — the other inner faculties — rise to the top. For "beauty" it mostly surfaces function words (part, in, with): raw counts let ubiquitous words dominate. Faded bars are noise; the cosines also cluster tightly (0.71–0.76), a sign the small corpus barely separates them.

The same comparison, stated as a contrast: cos("heart","mind") = 0.848 while cos("heart","time") = 0.583. The related pair sits measurably closer in the 200-dimensional space than the unrelated pair — and nothing about that judgement required a dictionary or a single human label, only co-occurrence counts.

Try it

Where it falls short

High-dimensional and sparse. Every vector has one dimension per vocabulary word, and most cells are zero — most word pairs never co-occur. The representation is wasteful and grows with the vocabulary.

Raw counts let frequent words dominate. There is no PMI weighting, so and, is, and the co-occur with everything and inflate every similarity. That is why the neighbours of "beauty" are noisy. The companion PMI subproject weights each co-occurrence by how surprising it is; here we vectorise the raw counts.

The vocabulary is fixed. Only the top 200 words get vectors. Rare words — which carry most of the distinguishing signal — have no vector at all.

One vector per word, so polysemy is unresolved. "bank" (river) and "bank" (money) collapse into a single blended vector that is the average of both senses and faithful to neither.

A small corpus gives noisy estimates. The sonnets are only ~17,600 tokens, so many counts are 0, 1, or 2 and a single coincidence can swing a score. Distributional methods only stabilise on large corpora.

How it works

Try it

Where it falls short