1948 / 1951 · Information Theory

Entropy & the Guessing Game

How many bits of surprise is a letter of English actually worth?

How it works

Entropy measures the average number of yes/no questions (bits) it takes to pin down a value drawn from some distribution. A coin flip is 1 bit — one well-chosen question resolves it. A fair six-sided die is log2(6) ≈ 2.58 bits. Claude Shannon's 1948 formula:

H  =  − Σ p(x) · log2( p(x) )

Applied to English: if you had to guess the next letter of a sonnet, how many bits of uncertainty are you facing? The answer shrinks the more of the preceding text you're allowed to see. Zero-order entropy uses only overall letter frequency; first-order uses the one character before — after "q", "u" is nearly certain. This is exactly the assumption a bigram Markov chain makes, restated as an amount of uncertainty rather than a lookup table.

order 0 (no context)
4.071 bits
order 1 (prev. char)
3.263 bits
order 2 (prev. 2 chars)
2.556 bits

Real entropy over Shakespeare's 154 sonnets (90,380 characters, 27-symbol alphabet: a–z + space). Each extra character of context recovers real bits: 0.808 saved by knowing one previous letter, another 0.706 by knowing two. 2H converts entropy into an "effective number of choices" — 16.8 at order 0, down to 9.6 at order 1 — the same move neural-lm/'s perplexity makes, one level up at the word.

Try it: cover the line, guess the letter

Shannon's 1951 guessing game, on a real line
Loading the corpus…

Where it falls short

Entropy from finite text is optimistic. The order-2 table behind this page has only 430 distinct two-character contexts drawn from 90,380 characters — some seen once or twice, so their measured "uncertainty" is artificially low. Treat higher-order numbers as lower bounds, not precise measurements.

Character entropy isn't meaning. Bits per character measure how compressible the symbol stream is, not whether the text makes sense. A model can have excellent (low) entropy while generating grammatical nonsense.

This uses a perfect frequency table, not a human. Shannon's original 1951 experiment used real human guessers and derived formal entropy bounds from their guess-rank distribution. This demo substitutes an exhaustive count for the human — simpler to reproduce, but a different estimator with different numbers.