2022 · Alignment

Alignment: Making Models Do What You Ask

The base model already existed. Alignment is what put it in front of the public.

Concept page Unlike the rest of this series, there is no script to run here. This era is defined by scale far beyond a laptop — training large language models and the human-feedback pipelines around them. This page explains the ideas; it ships no code.

How it works

A raw pretrained model is, at heart, a text predictor: it was trained to guess the next token over a huge pile of text, so it produces whatever continuation is most plausible. Ask it “How do I make bread?” and it may well answer with more questions — “How do I make pasta? How do I make rice?” — because a list of similar questions is exactly the kind of text that tends to follow on the open web. The model is not broken; it is doing precisely what it was trained to do. It just was never told that, when given a question, the wanted continuation is an answer.

The first fix is instruction tuning: fine-tune the model on many tasks phrased as instructions paired with good responses — summarize this, translate that, answer this question. This teaches the model the shape of following an instruction rather than merely continuing text, turning a predictor into something that at least tries to do what it is asked.

The second fix sharpens quality and behaviour. In Reinforcement Learning from Human Feedback (RLHF), you show people two model outputs for the same prompt and ask which they prefer. Those preference comparisons train a separate reward model that scores responses, and the language model (the policy) is then optimized to produce outputs the reward model rates highly. This pipeline is what turned GPT-3 into InstructGPT and then ChatGPT. Anthropic's Constitutional AI replaces much of that human labeling with model self-critique: the model revises its own answers against an explicit written set of principles — a “constitution” — and those self-judged comparisons supply the preference signal. The key insight of this whole era: the leap that put LLMs in front of the public was not a capability breakthrough. The base model could already do the task. It was alignment — making the model helpful, honest, and willing to follow instructions.

STAGE 1
Pretrained predictor
Next-token model trained on raw text. Continues plausibly; doesn't reliably answer.
STAGE 2
Instruction tuning
Fine-tune on many tasks phrased as instruction → good response. Learns the shape of doing what it's asked.
STAGE 3
RLHF  /  Constitutional AI
Collect preferences over outputs → train a reward model → optimize the policy against it.
output A output B human picks better (or model self-critiques vs. written principles)
reward model optimize the policy to score higher
STAGE 4
Aligned assistant
Helpful, honest, instruction-following. The version the public actually meets.
Raw model
> How do I make bread? How do I make pasta? How do I make rice? How do I make pizza dough at home?…
Aligned model
> How do I make bread? Mix flour, water, yeast, and salt; knead, let it rise, shape, then bake at high heat. Want a step-by-step recipe?

Same prompt, two models. The raw predictor continues with the most plausible text — a list of similar questions. The aligned model has been taught that a question calls for an answer. Nothing about the underlying knowledge changed; only what the model was optimized to do with it.

Where it falls short

Human preferences are noisy and expensive. Labelers disagree, get tired, and judge inconsistently, and collecting comparisons at scale costs real time and money. Whatever biases the labelers bring — cultural, linguistic, political — get baked into the reward model and, through it, into the assistant.

Optimizing a reward model invites gaming it. The reward model is a proxy for “good response,” not the real thing, so pushing hard on it produces reward hacking: outputs that score well without being good. A common failure mode is sycophancy — the model learns that agreeing with the user and telling people what they want to hear earns higher ratings than being correct.

The goals themselves are contested. “Helpful, honest, and harmless” sound simple but pull against each other in practice — the most helpful answer is sometimes not the most harmless one, and honesty can be unwelcome. There is no single objective everyone agrees on, and writing principles down (as in Constitutional AI) makes the choices explicit but does not make them uncontroversial.

It is unfinished, not a checkbox. Alignment is an ongoing, unsolved research problem. Each method narrows the gap between what we ask for and what we get; none of them closes it.