Scaling Laws & In-Context Learning

Concept This page has no runnable code. Every other stop on this journey ships a script you can run on a laptop; this era is defined by scale far beyond one — hundreds of billions of parameters trained on thousands of GPUs. So this is an explainer, not a demo.

How it works

In 2020, Kaplan et al. measured how a language model's loss — its error at predicting the next token — changes as you grow the three knobs you can buy: model size, dataset size, and compute. The striking result: loss falls as a smooth power law in each of them. Plot loss against compute on log-log axes and you get a near-straight line sloping steadily downward, holding across many orders of magnitude. That means performance is, to a startling degree, a predictable engineering function of scale rather than of clever architecture. You can forecast how good a not-yet-built model will be before spending a dollar training it.

Acting on this, OpenAI built GPT-3 with 175 billion parameters, and it revealed a behavior nobody trained for directly: in-context learning. Give the model a task description and a few worked examples inside the prompt — and it performs the brand-new task with no weight updates at all. Where earlier systems needed a labelled dataset and a round of fine-tuning per task, the scaled model could be steered at inference time, on the fly, from a handful of examples. This is an emergent capability: it wasn't an explicit training objective, it appeared as a side effect of scale.

The consequence reshaped the whole field: the prompt became the new programming interface. Instead of collecting data and retraining, you describe what you want in natural language, optionally show a few examples, and the same frozen model adapts. This is the period where language models stopped being narrow, single-purpose tools and became general-purpose.

Kaplan's 2020 recipe undersold data, though: it said to grow model size faster than dataset size. Hoffmann et al. (2022) — the Chinchilla paper — retrained the fit and found the optimal recipe uses far more data per parameter than Kaplan estimated. That correction is why later models, including the Llama 3 figure in the ladder below, train on trillions of tokens rather than the hundreds of billions GPT-3-era scaling would have suggested.

The defining image of the era: on log-log axes, test loss falls along a straight line as training compute grows. The dashed line is the power law fit on small models; the measured curve keeps tracking it as you scale up, so you can extrapolate the quality of a model you have not yet built. (Schematic — the relationship is real, the exact numbers are illustrative.)

sea → mer
sky → ciel
cat → ___ → chat
no weight updates — learned from the prompt alone

In-context learning: a few examples in the prompt are enough for the frozen model to infer the task (here, English→French) and continue it — the prompt as program.

The leap in scale

Every other stop on this journey runs on a laptop; this era cannot. Compare the training data on a logarithmic axis — where each step to the right means ten times more:

Four tidy bars — but the axis is logarithmic, so it has to stretch nine powers of ten from this repo (17,608 words, trained in ~4 seconds) to a current open model like Llama 3 (~15 trillion tokens). That is about a billion times more data and roughly ten-million times more parameters. The orderly bars actually hide the gap — which is exactly why the numbers are so easy to underestimate. (This repo's figure is a literal word count; the model figures are tokens — the subword units these models actually train on, each roughly 0.75 words, already folded into the reading-time estimates below.)

So picture it instead as reading time. At a brisk 200 words a minute, non-stop, here is how long it would take a person to read each model's training text end to end (converting each model's token count to words at ~0.75 words/token):

this repo
17,608 words

~1.5hours

GPT-3 · 2020
~300 billion tokens

~2,100years

frontier · Llama 3
~15 trillion tokens

~107,000years

~107,000 years of non-stop reading — roughly 20× all of recorded human history.

An afternoon versus deep prehistory. And reading is only the data: a frontier run also means tens of thousands of specialised chips, months of work by large research teams, and energy and money estimated in the millions of dollars. Closed-model figures are undisclosed and some values here are approximate — the point is the order of magnitude, not the last digit. This is the undertaking the rest of this page sits on top of.

Where it falls short

Scaling laws describe loss, not virtue. A power law over next-token error says nothing about truthfulness, safety, or usefulness. A bigger model is a more capable predictor, not automatically a more honest, harmless, or helpful one — it can become more fluent at being confidently wrong.

In-context learning is brittle. The same task can succeed or fail depending on the exact wording, the order of the examples, or trivial formatting. Steering by prompt is powerful but unstable, with no guarantees and little transparency about why a phrasing works.

Scaling is astronomically expensive and bounded. The curve keeps falling, but every step down costs exponentially more compute, energy, and money. High-quality training data is finite, and power budgets are real — you cannot brute-force forever.

A raw scaled model is still an unaligned next-token predictor. Left as trained, GPT-scale models continue the most likely text, not the text you actually want. Capability and obedience are different things, and scale only delivered the first.