The Transformer — NLP Journey

Concept page · no code to run Every earlier stop on this journey shipped a script you could run on the corpus in this repo. This one does not. The Transformer is defined by scale — data, compute, and model size far beyond a laptop — so the machinery only becomes useful when trained on internet-scale text. What follows is the architecture itself, not a runnable demo.

How it works

In 2017, Vaswani et al.'s paper "Attention Is All You Need" took self-attention — the routing primitive from the previous chapter — and assembled it into a complete, trainable machine. Multi-head self-attention does the routing, running several attention "heads" in parallel so the model can attend to different kinds of relationships at once. But attention alone is not enough, so the Transformer adds three more ingredients.

First, positional encodings: raw attention is order-blind — it sees a set, not a sequence — so a signal encoding each token's position is added to its embedding at the input. Second, a feed-forward layer applied independently to each position, giving the model room to transform each token's representation. Third, residual connections with layer normalization wrapping each sublayer, which let gradients flow cleanly and make it possible to stack dozens of these blocks and still train them stably. Stack that block N times and you have a Transformer.

The single most consequential decision was what they removed: recurrence. Where an LSTM must process token 1 before it can touch token 2, a Transformer processes the whole sequence in parallel. That one engineering fact — full parallelism across the sequence — is what made training on internet-scale text economically possible, and it is why every frontier model today is a Transformer or a close descendant.

RNN / LSTM — sequential

tok₁ → tok₂ → tok₃ → tok₄

Each step waits for the one before it. Token 4 cannot start until token 3 is done.

Transformer — parallel

tok₁ tok₂ tok₃ tok₄

All positions are processed at once in a single pass. This is the fact that made internet-scale training affordable.

One Transformer block: positional encodings are added to token embeddings at the input, then data flows through multi-head self-attention, Add & Norm, a feed-forward layer, and Add & Norm again — each sublayer wrapped in a residual connection (dashed). That block is stacked N times. Recurrence is gone, so the whole sequence moves through in parallel rather than token by token.

The form has barely changed since 2017. What changed is everything around it: the amount of data, the size of the models, and the methods used to train them. The architecture was the unlock; scale was the payoff.

Where it falls short

Self-attention is O(n²) in sequence length. Because every token attends to every other, the attention matrix grows with the square of the sequence. Double the context and you quadruple the compute and memory — the core reason long-context models are expensive and an active research frontier.

Position has to be bolted on. Attention is order-blind: it sees a set, not a sequence. Positional encodings are an artificial signal added to the embeddings to recover word order, not something the mechanism understands on its own.

A block on its own does nothing useful. One Transformer block is just attention plus a feed-forward step. The power comes only from stacking it deep and training it at massive scale — far beyond what a single machine or a corpus like this one can provide.

The architecture says nothing about training. "Attention Is All You Need" describes a shape. It is silent on how the model is taught, what objective it optimizes, or how it is aligned to be useful and safe. Those questions — not the wiring — are what the next decade was really about.