2018 · Pretraining & Transfer Learning
Train one giant model on raw text once — then reuse it for everything.
Until 2018 the standard recipe was: pick a task, gather labeled examples for it, and train a fresh model from scratch. Pretraining inverted that. You train one large Transformer on a generic self-supervised objective over an enormous pile of unlabeled text — the kind you can scrape from the internet for free. The model never sees a human-written label; it learns by predicting parts of the text from other parts. Because no labels are needed, the data is effectively unlimited, and that is what unlocked internet-scale training.
The expensive part now happens exactly once. After pretraining, adapting the model to a real task — sentiment, question answering, named-entity recognition — needs only a small amount of labeled data and a short round of fine-tuning that nudges the existing weights. The general language knowledge is already inside the model; fine-tuning just points it at a specific job. One costly run, then many cheap specializations.
Two flavors of the same idea appeared together. BERT (Google) reads text bidirectionally: it blanks out random words and learns to predict each from the context on both sides — ideal for understanding tasks where the whole sentence is available at once. GPT (OpenAI) reads strictly left to right, predicting the next token from everything before it — ideal for generation, where text is produced one word at a time. Different objectives, same breakthrough: learn from raw text first, specialize later.
One self-supervised pass over a mountain of unlabeled text produces a single reusable model. BERT learns by filling masked blanks using both sides; GPT learns by predicting the next token from the left. That one model is then copied and fine-tuned with a small labeled set per task — the costly step is paid once, the payoff reused everywhere.
Pretraining is extraordinarily expensive. Training a model of this size takes vast amounts of data, compute, and electricity — in practice only large, well-funded organizations can afford to do it. Everyone else can only download and fine-tune what those few have already trained.
It inherits whatever is in the raw data. Because the model learns from un-curated internet text, it absorbs the biases, stereotypes, and factual errors of that text along with the useful patterns — and at scale those flaws are hard to find and harder to remove.
Fine-tuning is still per-task and can be brittle. Each new application still needs its own labeled dataset, and adapting the weights for one task can cause the model to catastrophically forget things it knew before. The adaptation step is cheaper than starting over, but it is not free or robust.
It predicts plausible text — it does not do what you ask. A pretrained-then-fine-tuned model is fundamentally still a next-word (or fill-the-blank) predictor. It will produce text that looks right far more readily than text that reliably follows an instruction. Closing that gap is the job of a later era: alignment.