Neural Language Model — NLP Journey

Every technique before this one counts things and looks them up. A neural language model instead learns. Bengio and colleagues showed in 2003 that a small feedforward network, trained to predict the next word from the previous few, could beat n-gram models — and, as a side effect, discover useful word representations on its own.

Each word gets a dense embedding: a short vector of real numbers. The network looks up the embeddings of the two previous words, concatenates them, passes them through a tanh hidden layer, and produces a softmax probability over the whole vocabulary. Crucially, the embeddings are parameters — they are learned by backpropagation alongside everything else, so words used in similar contexts drift toward similar vectors.

Training nudges every weight to make the true next word more probable. As it does, a single number falls: the cross-entropy loss. Watching it drop is watching the model learn.

Real training run on Shakespeare's 154 sonnets (200-word vocabulary, 4,353 trigrams, 16,952 trainable values). Each epoch the average loss falls and perplexity drops from 134 to 23 — the network is genuinely learning to predict, not memorising lookups.

thou →

hath, thyself,
therefore, gentle

my →

thy, his,
whose, times

Nearest neighbours in the learned 24-dimensional embedding space (cosine). The network was never told these words are related — it inferred it from how they are used.

Try it

Where it falls short

Fixed, tiny context. It still only sees the previous two words — the same fixed-window limitation as an n-gram model. Anything further back is invisible, so long-range dependencies cannot be captured.

Small and slow. Pure-JS training caps us at a 200-word vocabulary and a few thousand examples; the sonnets are far too small a corpus to learn rich embeddings. Real neural LMs need orders of magnitude more data and compute.

The softmax bottleneck. Producing a probability over the entire vocabulary on every step is the expensive part — a cost later methods (hierarchical softmax, Word2Vec's negative sampling) were designed to avoid.

No memory of the sentence. Each prediction is independent; nothing carries information about the words before the two-word window.

How it works

Try it

Where it falls short