1990 / 1997 · Sequence Models with Memory

Recurrent Neural Network

Give the network a memory, and let it read one symbol at a time.

How it works

The feedforward neural language model could only see a fixed window of previous words. A recurrent network removes that limit: it reads a sequence one symbol at a time, carrying a hidden state that is updated at every step — a running memory of everything seen so far. A prediction can, in principle, depend on something many steps back.

At each step the network blends the current input character with its previous hidden state, then predicts the next character. The middle term is the whole trick — the past flowing into the present. One notational shortcut first: a matrix (each W) times a vector just means "mix the numbers together with learned weights" — so the formula below reads as new memory = squash(a mix of the current input, plus a mix of the previous memory):

ht = tanh( Wxh·xt + Whh·ht-1 + bh )
yt = softmax( Why·ht + by )

Training unrolls the network into one layer per character and pushes gradients backward through all of them — backpropagation through time. Because that repeated multiplication tends to make gradients explode, they get clipped (capped at a maximum size before each update), and weights are updated with Adagrad, which shrinks the learning rate for parameters that have already moved a lot so the update settles down instead of overshooting.

This demo is character-level: it is never told what a "word" is. Trained on the sonnets for 5,000 steps, the average loss per character falls from the uniform baseline log(28) ≈ 3.33 to 1.96 — and the samples go from noise to word-shaped text:

before training (iteration 0)wcabyncmcrsovw abypvdhsxgmiyarctymjtbybnzrgnsqdntjs cyjnoyvvbpgyzijonaelahflwt…
after 5,000 iterationsbauths and bo pyorney by if lovk if hand ove douglk all at hice thus bite is all the for ofnoor sid thind t unl dsor houf to gour wastiln cofong dell ay youvost…

It has discovered that letters cluster into chunks separated by spaces, and many chunks — and, by, if, hand, all at, thus, is all the for, to — are real words. A character n-gram model of similar size cannot do this; the RNN's hidden state lets it remember how far into a word it is.

negative ~0 positive

The first 4 of 64 hidden units (rows) as the trained network reads the phrase "shall i compare" (columns), one character at a time. Each cell is a real tanh activation in −1…1. The state lurches at word boundaries (the spaces) and re-settles inside each word — this changing 64-number vector is the network's memory.

Try it

Train the RNN live — watch it learn to spell
Loading the corpus…

Starts as pure noise; within a couple of thousand iterations it discovers letters cluster into word-shaped chunks separated by spaces. This trains in your browser tab — small by necessity.

Where it falls short

It forgets. In theory the hidden state can remember arbitrarily far back; in practice a vanilla RNN's gradients vanish as they propagate backward through many tanh steps, so its effective memory is short. The LSTM (1997) added gates to fix exactly this, and it — not the vanilla RNN — powered 2010s NLP.

It is strictly sequential. Step t cannot be computed before step t−1, so training cannot be parallelised across the sequence. This is the speed bottleneck the Transformer later removed.

One vector for the whole past. Squeezing an entire sentence into a single fixed-size hidden state is lossy — the "encode a sentence into one vector" pattern is precisely what the attention mechanism was invented to repair.