Levenshtein Edit Distance

This is the first technique in the series that drops below the word. Every method before it treated the word as an atom; edit distance looks inside, at the sequence of characters. The Levenshtein distance between two strings is the minimum number of single-character edits — insertions, deletions, and substitutions — needed to transform one into the other. "loue" becomes "love" with a single substitution, so their distance is 1.

Trying every possible sequence of edits is exponential, so Levenshtein's contribution was a dynamic-programming solution. You fill an (m+1) × (n+1) grid where each cell holds the edit distance between a prefix of the source and a prefix of the target. The first row and column are the base cases — building a prefix up from the empty string, one edit per character. Every other cell is the cheapest of three moves into it: delete, insert, or substitute (free if the two characters already match).

The answer to the whole problem lands in the bottom-right cell. Trace the cheapest path back to the top-left corner and you recover the actual alignment — which characters matched, which were edited.

		l	o	v	e
	0	1	2	3	4
l	1	0	1	2	3
o	2	1	0	1	2
u	3	2	1	1	2
e	4	3	2	2	1

optimal alignment path base cases the distance

The real matrix the program computes for "loue" → "love". The highlighted diagonal is the cheapest path: l and o match (cost stays 0), u vs v differ so the cost climbs by one substitution, then e matches and the cost holds. The bottom-right cell, 1, is the edit distance. Run as a spell-checker over Shakespeare's 3,170-word vocabulary, "loue" sits one edit from three real words at once — lose, loud, and love — which distance alone cannot tell apart. (The README's worked matrix uses loue → lose instead — the CLI's default, alphabetically-first pick among the three; both cost exactly 1 edit.)

Try it

Where it falls short

It sees surface form, never meaning. "car" and "automobile" are synonyms yet sit eight edits apart, while "cat" and "cot" are one edit apart and completely unrelated. Distance on letters is a poor proxy for distance in meaning — the very gap that distributional and embedding methods later exist to close.

It is O(m × n) per comparison. Filling the grid costs the product of the two string lengths, and a naive spell-check runs it once for every word in the vocabulary. Over a real dictionary of hundreds of thousands of words that linear scan is far too slow, so production systems reach for tries, BK-trees, or finite-state automata.

All edits cost the same. Substituting "u→v" — a classic OCR and early-modern-spelling confusion — costs exactly as much as "u→z". Real applications weight edits by keyboard adjacency, phonetic similarity, or observed typo frequency; plain Levenshtein has no way to express that one mistake is likelier than another.

How it works

Try it

Where it falls short