TF-IDF — NLP Journey

TF-IDF is the first technique in this series that works at the document level rather than the word-sequence level. The goal is no longer to generate text — it is to rank documents by relevance to a query. The key insight is that a word appearing in every document tells you nothing about any particular one. Only words that are common in one document but rare across the collection are genuinely informative.

Two numbers are multiplied together. Term Frequency (TF) measures how often a word appears in this document relative to its length. Inverse Document Frequency (IDF) is the log of the total number of documents divided by how many contain this word. A word appearing in every document gets IDF = log(1) = 0, no matter how often it appears locally.

count of word in document ÷ total words in document

IDF

log( total documents ÷ documents containing word )

score

TF × IDF

"the"

TF (freq in sonnet 2) 0.034

IDF (in 154/154 docs) 0.000

TF-IDF score 0.000

"trenches"

TF (freq in sonnet 2) 0.0085

IDF (in 1/154 docs) 5.037

TF-IDF score 0.043

"the" appears four times as often as "trenches" in Sonnet 2, yet scores zero — because "the" appears in all 154 sonnets and therefore carries no information about which sonnet you are reading. "trenches" appears in only one sonnet; its rarity is the signal.

To search, compute TF-IDF scores for every word in every document, then for a query word look up its score in each document and rank by the result. The highest-scoring documents are the most relevant. Word order is irrelevant — the document is just a weighted bag of terms.

Try it

Where it falls short

Word order is gone. "The dog bit the man" and "The man bit the dog" are identical documents to TF-IDF. The model has no concept of syntax, grammar, or sequence. This is a deliberate simplification that makes retrieval tractable, but it discards real information.

Vocabulary mismatch. A query for "automobile" will not match a document about "cars". TF-IDF works on exact token overlap. Synonyms, morphological variants ("run" vs "running"), and paraphrases are invisible — each is a different token with its own independent score.

No negative information. TF-IDF can tell you that "winter" is important in a document. It cannot tell you that "summer" being absent might be equally informative. The model scores only presence, not absence.

Document length bias. Longer documents tend to have higher raw term frequencies. The TF normalisation helps but does not fully solve this — different normalisation schemes produce different rankings, and none is universally correct.

Context-free meaning. "Bank" in "river bank" and "bank" in "savings bank" get the same token and the same score. Polysemy — one word, multiple meanings — is invisible to any bag-of-words model.

How it works

Try it

Where it falls short