2014 · Encoder–Decoder Machine Translation
Squeeze a whole sentence into one vector — then watch that vector become the problem.
modern/. Those are too big to train on a laptop; this one predates that problem entirely — a small seq2seq translator would fit in this repo's zero-dependency rules just fine. What it needs instead is a parallel corpus: paired sentences in two languages, aligned to each other. This repo ships two English sonnet collections, not a translation dataset, so there is nothing honest to train here. What follows is the architecture and the specific problem it ran into — the problem that motivated attention.
rnn/ showed a recurrent network reading one sequence and carrying a hidden state forward. Machine translation needs something more: read one sequence (the source sentence) and produce a different sequence (the target sentence, in another language, almost never the same length). Sutskever, Vinyals, and Le (2014), alongside Cho et al. (2014), solved this with two RNNs back to back:
An encoder RNN reads the source sentence one word at a time, the same way rnn/ does, and keeps only its final hidden state — a single fixed-size vector standing in for the meaning of the entire sentence. A decoder RNN then starts from that one vector and generates the target sentence one word at a time, feeding each word it produces back in as the next step's input.
Top: three source words compress into one fixed-size vector; the whole decoder output is generated from that single point. Bottom: whether the source sentence is 3 words or 30, the vector between encoder and decoder is the same fixed size — it cannot grow to hold more.
For short sentences this works surprisingly well. But the fixed-size vector is a hard ceiling: cramming an entire sentence's meaning into one point of fixed dimensionality is inherently lossy, and the longer the sentence, the more gets thrown away. Translation quality on this architecture degrades sharply as sentences get longer — not gracefully, but in a way that tracks the information the bottleneck physically cannot hold. This is the bottleneck, and it is the single problem the next four years of NLP research organized around solving.
Bahdanau, Cho, and Bengio (2014) asked a direct question: why should the decoder rely on only the encoder's final state, when every intermediate state the encoder produced while reading is still sitting right there? Their fix, attention in its original form — what this repo's attention/ page calls cross-attention — lets the decoder, at every generation step, look back across all of the encoder's hidden states and compute a weighted blend of them, with weights learned to reflect which source words are relevant to the word being generated right now.
Every decoder step gets its own weighted mixture of all encoder states — the darker the line, the more that source word contributes to that target word. Nothing is squeezed through a single fixed point anymore.
This is the same softmax-weighted-average mechanic as this repo's attention/ page, just pointed at a different sequence: the decoder's query looks across the encoder's keys and values, rather than across other positions in its own sequence (self-attention). Removing the bottleneck this way was a large, immediate jump in translation quality, especially on long sentences — and it planted the idea that would outgrow the RNN entirely.
rnn/) give a network persistent memory over a sequence.The bottleneck is structural, not a training bug. No amount of extra training data fixes a fixed-size vector's information limit — the architecture itself caps how much of a long sentence can survive to the decoder.
Still sequential. Both the encoder and decoder are RNNs: token t cannot be processed until token t−1 is done. Attention fixed what information reaches the decoder; it did nothing about the RNN's step-by-step training speed. That problem is what the Transformer removes, by dropping recurrence entirely.
Attention here is only cross-attention. Bahdanau's fix lets the decoder attend over the encoder. It does not let encoder positions attend to each other, or decoder positions attend to each other — that generalization, self-attention, is Vaswani et al.'s 2017 move, covered on this repo's attention/ page.