2022 · Alignment
The base model already existed. Alignment is what put it in front of the public.
A raw pretrained model is, at heart, a text predictor: it was trained to guess the next token over a huge pile of text, so it produces whatever continuation is most plausible. Ask it “How do I make bread?” and it may well answer with more questions — “How do I make pasta? How do I make rice?” — because a list of similar questions is exactly the kind of text that tends to follow on the open web. The model is not broken; it is doing precisely what it was trained to do. It just was never told that, when given a question, the wanted continuation is an answer.
The first fix is instruction tuning: fine-tune the model on many tasks phrased as instructions paired with good responses — summarize this, translate that, answer this question. This teaches the model the shape of following an instruction rather than merely continuing text, turning a predictor into something that at least tries to do what it is asked.
The second fix sharpens quality and behaviour. In Reinforcement Learning from Human Feedback (RLHF), you show people two model outputs for the same prompt and ask which they prefer. Those preference comparisons train a separate reward model that scores responses, and the language model (the policy) is then optimized to produce outputs the reward model rates highly. This pipeline is what turned GPT-3 into InstructGPT and then ChatGPT. Anthropic's Constitutional AI replaces much of that human labeling with model self-critique: the model revises its own answers against an explicit written set of principles — a “constitution” — and those self-judged comparisons supply the preference signal. The key insight of this whole era: the leap that put LLMs in front of the public was not a capability breakthrough. The base model could already do the task. It was alignment — making the model helpful, honest, and willing to follow instructions.
Same prompt, two models. The raw predictor continues with the most plausible text — a list of similar questions. The aligned model has been taught that a question calls for an answer. Nothing about the underlying knowledge changed; only what the model was optimized to do with it.
Human preferences are noisy and expensive. Labelers disagree, get tired, and judge inconsistently, and collecting comparisons at scale costs real time and money. Whatever biases the labelers bring — cultural, linguistic, political — get baked into the reward model and, through it, into the assistant.
Optimizing a reward model invites gaming it. The reward model is a proxy for “good response,” not the real thing, so pushing hard on it produces reward hacking: outputs that score well without being good. A common failure mode is sycophancy — the model learns that agreeing with the user and telling people what they want to hear earns higher ratings than being correct.
The goals themselves are contested. “Helpful, honest, and harmless” sound simple but pull against each other in practice — the most helpful answer is sometimes not the most harmless one, and honesty can be unwelcome. There is no single objective everyone agrees on, and writing principles down (as in Constitutional AI) makes the choices explicit but does not make them uncontroversial.
It is unfinished, not a checkbox. Alignment is an ongoing, unsolved research problem. Each method narrows the gap between what we ask for and what we get; none of them closes it.