Tool Use & Agents — NLP Journey

Concept page — nothing to run Every earlier stop on this journey ships a small program you can run on a laptop. This one does not. The defining feature of this era is scale — of data, compute, and model size — far beyond what a single machine can reproduce. So this page explains the idea rather than executing it.

How it works

A frontier model is still, at its core, a next-token predictor. What changed is what it is allowed to emit. Instead of only producing prose, the model can produce a tool call: a structured request to run code, search the web, query a database, or read and edit a file. The surrounding system executes that request and hands the result back to the model as more text to condition on. The model's vocabulary of actions is suddenly as wide as the software around it.

That feedback turns a single prediction into a loop. The model observes the current state (the task, plus whatever results have come back so far), decides what to do next, acts by calling a tool, then observes the result — and repeats, until the goal is met. The language model has become the controller of a perceive–act cycle. No single forward pass has to be right; the loop can check its own work, retry, and adjust.

This is the shift from “what can a model say?” to “what can a model do?” And notice what the loop quietly absorbs: a retrieval call is exactly the TF-IDF ranking from earlier in this repo; a code-execution step is plain classical computing; an embedding lookup is the word-vector cosine similarity. Every technique in this museum reappears here as a tool, now invoked under the direction of a language model. Fittingly, this very repository was explored and extended by exactly such an agent — reading files, searching the corpus, and writing the page you are looking at.

The language model sits at the centre as a controller. It runs an observe → decide → act cycle; the act step branches out to real tools — web search, code execution, a database, the filesystem — and each tool's result is fed straight back into observe, closing the loop. The cycle turns one prediction into a goal-directed sequence of actions.

Where it falls short

Errors compound across steps. A single prediction can be wrong and harmless. In a loop, a mistake early on becomes the input to every later step — a wrong assumption, a misread file, a bad search — and can quietly derail the entire task long before anyone notices.

The safety surface widens dramatically. A model that can only talk can, at worst, say something wrong. A model that can act — run code, edit files, hit external services — can do something wrong, or be steered into doing harm. Giving a pattern-matcher real-world levers is a genuinely different risk than giving it a microphone.

Loops can fail as loops. Agents get stuck repeating themselves, thrash between dead ends, run up unbounded cost, or pursue a subtly wrong interpretation of the goal with great persistence. Autonomy multiplies both the upside and the ways things go sideways.

The substrate has not changed. Underneath the tools and the loop is the same next-token pattern-matching this whole journey is built on. Wrapping it in actions adds capability, not guaranteed correctness, reliability, or understanding.