2020 → today · Retrieval-Augmented Generation
Don't make it memorise the world. Let it look things up.
This is where the whole journey closes a loop. RAG bolts a retriever — the very TF-IDF machinery from earlier in this series — onto a generator. At question time the retriever fetches the most relevant documents, and the generator answers using only those documents.
It runs in three steps. Retrieve: score every document against the query with TF-IDF and keep the top few. Augment: those documents become the context — the only text the generator may draw on. Generate: produce an answer grounded in that context. Below, a bigram Markov model stands in for a large language model; the standin is deliberate, because the lesson is grounding, not fluency.
This is how a model answers questions about private, proprietary, or up-to-the-minute data it was never trained on — and how modern systems curb fabrication by anchoring output to retrieved evidence. Every technique in this repo that "became a component" is visible here: retrieval feeding generation.
A real run. The retriever finds the time-themed sonnets (sonnet 19 is literally "Devouring Time"). Fed only those, the grounded generator speaks in the language of time; the ungrounded model, trained on everything, wanders off to seas and kings. Same generator, same seed — only the retrieved context differs.
The grounded text is built only from the retrieved sonnets, so it stays on-topic; the ungrounded baseline draws on the whole corpus and wanders. Change the query and watch what gets retrieved — and how the grounded output follows.
Garbage in, garbage out. If retrieval surfaces the wrong documents, the generator confidently grounds its answer in irrelevant text. RAG is only ever as good as its retriever.
Lexical retrieval misses meaning. TF-IDF matches exact token overlap, so a query for "automobile" will not retrieve a passage about "cars". Production systems swap in dense embedding retrieval to fix this — at the cost of needing trained vectors.
The generator can still drift. Conditioning on context is a nudge, not a guarantee; a real LLM may blend retrieved facts with its own priors and fabricate anyway. (Our bigram standin has the opposite limit — it can only recombine context words, so it can never be fluent.)
Chunking and ranking are fiddly. How you split documents, how many you fetch, and in what order all change the final answer.