Retrieval-Augmented Generation (RAG) is how you make an AI answer questions about your business — your docs, your tickets, your policies — instead of whatever it memorised during training. The idea is simple: before the model answers, you fetch the most relevant pieces of your own data and hand them to it as context.
The hard part isn't the model. After shipping RAG for several clients, we found the same thing every time: the answer is only as good as the chunk you retrieved. If the right passage never makes it into the prompt, no amount of clever wording saves you.
The formula we settled on
We now treat RAG as a pipeline of four measurable stages, and we tune them in order:
The one metric that matters first
Before we judge a single answer, we measure recall@k — did the correct chunk land in the top k results we sent to the model?
recall@k = (queries where the right chunk is in top k) / (total queries)
If recall@k is low, debating answer quality is pointless — the model is guessing. We push recall@k above ~0.9 first, then tune the generation prompt. Separating retrieval quality from answer quality is the single biggest unlock.
Hybrid score, concretely
For ranking we blend the two signals with a weighted sum, normalised to 0–1:
score = α · cosine(query, chunk) + (1 − α) · bm25(query, chunk)
We start at α = 0.5 and tune per dataset. Document-heavy, jargon-rich corpora (legal, medical) lean lower (keyword matters more); conversational corpora lean higher.
What this means for you
You don't need a bigger model. You need a measured retrieval pipeline. We build it in your stack, wire up a small golden dataset of real questions, and tune against recall@k until the answers are grounded and citable — not confident guesses.