2020 → today · Retrieval-Augmented Generation

Retrieval-Augmented Generation

Don't make it memorise the world. Let it look things up.

How it works

This is where the whole journey closes a loop. RAG bolts a retriever — the very TF-IDF machinery from earlier in this series — onto a generator. At question time the retriever fetches the most relevant documents, and the generator answers using only those documents.

It runs in three steps. Retrieve: score every document against the query with TF-IDF and keep the top few. Augment: those documents become the context — the only text the generator may draw on. Generate: produce an answer grounded in that context. Below, a bigram Markov model stands in for a large language model; the standin is deliberate, because the lesson is grounding, not fluency.

This is how a model answers questions about private, proprietary, or up-to-the-minute data it was never trained on — and how modern systems curb fabrication by anchoring output to retrieved evidence. Every technique in this repo that "became a component" is visible here: retrieval feeding generation.

Query
"the passage of time"
↓  retrieve · TF-IDF over 154 sonnets
Step 1 — Top 3 retrieved (the context)
sonnet #49 · score 0.0583
"Against that time, if ever that time come,"
sonnet #19 · score 0.0433
"Devouring Time, blunt thou the lion's paws,"
sonnet #106 · score 0.0422
"When in the chronicle of wasted time"
↓  augment + generate
Step 3 — Generate (same seed, same model)
Grounded — only the 3 retrieved sonnets the thing it was shall reasons find of wasted time despite thy love shall reasons find of wasted time when in thy wrong my defects when as thou fleets and
Ungrounded — trained on all 154 sonnets the sea the rearward of love control o loves use is crownd but doth cover every one hath in doubt till now hes king are for they or wit or

A real run. The retriever finds the time-themed sonnets (sonnet 19 is literally "Devouring Time"). Fed only those, the grounded generator speaks in the language of time; the ungrounded model, trained on everything, wanders off to seas and kings. Same generator, same seed — only the retrieved context differs.

Try it

Retrieve, then generate
Loading the corpus…

The grounded text is built only from the retrieved sonnets, so it stays on-topic; the ungrounded baseline draws on the whole corpus and wanders. Change the query and watch what gets retrieved — and how the grounded output follows.

Where it falls short

Garbage in, garbage out. If retrieval surfaces the wrong documents, the generator confidently grounds its answer in irrelevant text. RAG is only ever as good as its retriever.

Lexical retrieval misses meaning. TF-IDF matches exact token overlap, so a query for "automobile" will not retrieve a passage about "cars". Production systems swap in dense embedding retrieval to fix this — at the cost of needing trained vectors.

The generator can still drift. Conditioning on context is a nudge, not a guarantee; a real LLM may blend retrieved facts with its own priors and fabricate anyway. (Our bigram standin has the opposite limit — it can only recombine context words, so it can never be fluent.)

Chunking and ranking are fiddly. How you split documents, how many you fetch, and in what order all change the final answer.