Zipf's Law — NLP Journey

Count how often each word appears in a body of text, then rank the words from most to least frequent. Zipf's law says a word's frequency is roughly inversely proportional to its rank: the most common word appears about twice as often as the second, three times as often as the third, and so on.

A weaker way to say the same thing: rank × frequency moves far less than either rank or frequency alone. A word at rank 1 with frequency 490 gives a product of 490; a word at rank 50 with frequency ~54 gives a product of ~2700 — that is a climb, not a constant, and poetry's heavy function words bend the low ranks especially hard. The real law shows up when you plot rank against frequency on logarithmic axes: that climbing, uneven ratio becomes a near-straight diagonal line. The line is the law; the "constant" is a rough intuition pump, not a precise claim.

Real word counts from Shakespeare's 154 sonnets (17,608 tokens, 3,170 unique words) on log-log axes. The orange curve tracks the dashed ideal closely. It sags slightly at the very top — poetry leans on "and" more than a pure power law predicts — and flattens into a long tail of words that appear just once.

This shape is why function words like "the", "and", and "of" are nearly useless for telling documents apart, and why the rare words in the tail carry almost all of the distinguishing signal. Every frequency-based technique that follows is, in some sense, a response to this curve.

Try it

Rank the words — and watch rank × frequency stay flat

Corpus

Show top 15

Loading the corpus…

The far-right column (rank × frequency) climbs far more slowly than the bars shrink, rather than exploding or staying flat — the signature of a Zipfian distribution. Switch authors: the same curve appears.

Where it falls short

It is descriptive, not useful on its own. Zipf's law tells you the distribution of word frequencies, but nothing about meaning, sequence, or relevance. It cannot generate text, classify a document, or answer a query.

It says nothing about what words mean. Two corpora can share an almost identical Zipf curve while being about completely different things. The law constrains the shape of the counts, not their content.

The constant is not truly constant. Real corpora deviate at both ends — the very top words overshoot, and the tail is noisier than the law predicts. Zipf's law is a strong tendency, not an exact equation, and refinements (Zipf–Mandelbrot) add parameters to fit the deviations.

How it works

Try it

Where it falls short