2024 – 2025 · Reasoning & Test-Time Compute
Spend more computation at answer time, not just at training time — and watch accuracy climb.
index.js to run here. Unlike the earlier stops on this journey, this era is defined by scale far beyond a laptop — frontier models trained on enormous clusters, often with reinforcement learning. This page explains the idea; it does not ship a runnable demo.
The first crack in the old assumption came from prompting. Researchers found that simply asking a model to “think step by step” — chain-of-thought prompting — unlocked reasoning ability that was already latent in the network. Instead of leaping straight to an answer, the model wrote out intermediate steps, and on multi-step problems those written steps dramatically improved the final answer.
Reasoning models then made this the default behaviour rather than a prompting trick. They are trained — often with reinforcement learning — to produce long internal chains of intermediate steps before committing to an answer, and to spend more compute on harder problems. An easy question gets a short pass; a hard one gets a long, deliberate one.
The big idea is that this introduces a new scaling axis. It breaks the assumption that a model’s answer must come out of a single forward pass. By allocating more computation at inference time, performance improves with thinking time, not only with model size. Two levers now exist: how big the model is, and how long you let it think.
Q: A shelf holds 3 boxes. Each box has 4 jars. Each jar holds 6 marbles. How many marbles in total? — The model pattern-matches to a single number and blurts an answer.
Same question, same model. By writing out boxes × jars first, then jars × marbles, each step is small enough to get right — and the steps compose into the correct total.
Top: the same model on the same problem. A single forward pass pattern-matches and misses; an explicit chain of intermediate steps composes small, correct moves into the right total. Bottom (schematic): accuracy tends to rise with thinking tokens spent at inference, then plateaus — a scaling axis independent of model size.
Thinking is not free. Every extra step is more tokens generated, so longer reasoning means higher latency and more compute cost per query. There is a real tradeoff: you pay in time and money for each increment of accuracy.
The visible chain is not guaranteed to be the true process. The reasoning a model writes out is itself generated text. It can rationalize a wrong answer with a fluent, convincing-looking chain, so the steps you see are not a reliable audit of how the model actually reached its conclusion.
Longer chains can compound errors. A mistake early in a long chain propagates — later steps build confidently on a flawed premise, and more thinking sometimes drifts further from the right answer rather than toward it.
It helps verifiable domains most. Math, code, and formal planning have checkable intermediate steps, so reasoning pays off there. Open-ended, subjective, or taste-driven questions benefit far less from grinding out more steps.
It is not a substitute for knowing a fact. No amount of step-by-step deliberation conjures information the model never learned. Reasoning rearranges and combines what is there; it cannot reason its way to a fact it does not have.