1990s · Supervised Learning

Naive Bayes

The first method here that needs an answer key — and the math behind early spam filters.

How it works

Every technique before this one was unsupervised: count words, rank them, weight them, generate from them — never told what the right answer is. Naive Bayes is the turn toward supervised learning. We hand it labeled examples (“this sonnet is Shakespeare, that one is Browning”) and it learns to tell them apart. The same machinery filtered spam from email throughout the 1990s.

Bayes' theorem flips a question we can't answer directly — what is the probability this document belongs to class C? — into one we can estimate from training counts: how likely is this wording under each class? Keeping only the part that varies between classes:

P(class | doc)  ∝  P(class) · ∏ P(word | class)

We compute that score for each class and pick the larger. The method is called naive because the product quietly assumes every word is independent of every other given the class — plainly false (“summer” and “day” co-occur), yet it works remarkably well for choosing a winner. To keep an unseen word from zeroing out the whole product, we add Laplace (add-1) smoothing: P(word|class) = (count + 1) / (total + V). All arithmetic runs in log-space to dodge underflow.

← tilts Shakespeare tilts Browning →
your
4.39e-3
1.33e-4
doth
3.99e-3
2.67e-4
beauty
2.57e-3
1.33e-4
love
7.64e-3
8.66e-3
beloved
1.14e-4
1.60e-3
between
5.70e-5
1.07e-3
drop
5.70e-5
1.33e-3

Real smoothed P(word | class) from training on 160 sonnets. Orange is P(word | Shakespeare), grey is P(word | Browning). Each word adds log P(word|class) to that class's running score, so “your” and “doth” shove the decision toward Shakespeare while “beloved” and “between” pull toward Browning. “love” is nearly balanced (logLR ≈ −0.13) — both poets lean on it, so it decides nothing.

Run on the sonnets, the classifier holds out every 5th poem of each author (38 test sonnets), trains on the other 160, and gets 36 of 38 right — 94.7% accuracy. The two misses are both Browning poems light on her tell-tale words. The model never learned anything about poetry; it just learned which words each author over-uses.

Try it

Classify a line: Shakespeare or Browning?
Training on both corpora…

The classifier was trained on every sonnet of each author. It has no idea what the words mean — only how their frequencies differ between the two writers.

Where it falls short

The independence assumption is false. Words are correlated (“thou art”, “summer day”), so the classifier double-counts evidence. It still picks the right class often, but its probability estimates come out wildly overconfident.

Bag of words discards order. “love is not beauty” and “beauty is not love” receive identical scores. Syntax and negation are invisible.

It needs labeled data. Unlike every earlier technique here, Naive Bayes can't start cold — someone must label the training documents first. That human cost is the price of supervision.

It is sensitive to class imbalance and smoothing. With 154 Shakespeare sonnets to 44 Browning, the prior tilts toward Shakespeare; both errors are Browning misread as Shakespeare. The add-1 constant silently fixes how much an unseen word costs, and on a small vocabulary it can swamp the real signal.

Frequent function words can dominate. Without care, ubiquitous words pile up tiny contributions that drown out the rare, genuinely indicative terms — the very problem TF-IDF was built to address.