The Attention Mechanism

Every technique before this one read words through a fixed window — a Markov chain sees the last few tokens, TF-IDF sees an unordered bag, a co-occurrence vector sees a ±3 neighbourhood. Attention removes the window. It lets each token look directly at every other token in the sequence and decide, for itself, which ones are relevant and by how much.

For each token, attention asks one question: given my query, which other tokens' keys match it, and how should I blend their values? A dot product scores every query against every key; the √d keeps those scores at a sane scale; a softmax turns each row of scores into weights that sum to 1; and those weights average the value vectors into a new, context-aware representation for the token.

The mechanism arrived in two steps. Bahdanau et al. (2014) introduced it as cross-attention: a translation decoder learning to look back at relevant encoder states instead of squeezing the whole source sentence into one fixed vector. Vaswani et al. (2017) then showed attention alone — no recurrence needed — was enough, generalising it to self-attention: every token in one sequence attending to every other token in that same sequence. What follows builds self-attention, the 2017 mechanism; the two differ in what gets attended to, not in the Q/K/V mechanic itself.

The heatmap below is the real attention-weight matrix for the phrase "thy love is as fair" over Shakespeare's sonnets. To keep the demo interpretable this is a deliberately simplified, single-head attention with Q = K = V = the token embeddings — no learned projection matrices. Each embedding is a genuine ±3-word co-occurrence vector over the corpus (L2-normalised), so words that keep each other's company in the sonnets attend to one another. It is fully deterministic.

query ↓
key →

thy

love

fair

thy

0.718

0.112

0.068

0.038

0.064

love

0.104

0.665

0.101

0.071

0.060

0.066

0.105

0.695

0.057

0.077

0.041

0.081

0.062

0.763

0.053

fair

0.065

0.066

0.082

0.051

0.735

0.0 1.0 attention weight

Each row is one query token deciding where to look; the row sums to 1. The bright diagonal is self-attention — a vector is always most similar to itself. The off-diagonal cells are the real routing: thy and love attend to each other (0.112 / 0.104), is reaches for love (0.105), and fair leans on is (0.082) — all pairs that genuinely sit close together in the sonnets.

The final step blends each value vector by these weights, producing one new vector per token that is informed by the whole phrase. That blended vector — not the original embedding — is what a real Transformer passes up to its next layer. Real attention adds learned Q/K/V projection matrices and several heads in parallel; here you see the routing mechanic on its own.

Try it

Where it falls short

It is O(n²) in sequence length. Every token attends to every other, so the weight matrix is n × n. Double the sequence and you quadruple the compute and memory. This quadratic cost is the central engineering constraint of every Transformer, and the reason long-context models are hard.

Position is not inherent. Attention sees a set, not a sequence — shuffle the tokens and the weight between any given pair is unchanged. Real Transformers must add positional encodings to the embeddings precisely because attention alone is order-blind.

This demo shows only routing. With Q = K = V, no learned projections, and a single head, you see which tokens talk to which — not the representational power that training provides. Real attention learns what to look for; here the "what" is fixed to raw co-occurrence similarity.

One layer is just a weighted average. Usefulness comes from stacking dozens of attention + feed-forward layers and training them on enormous corpora — far beyond a laptop demo. A single attention step is necessary but nowhere near sufficient.

How it works

Try it

Where it falls short