All research
[ 01 ]Retrieval

RAG: getting the right context, every time

7 min read

Retrieval-Augmented Generation (RAG) is how you make an AI answer questions about your business — your docs, your tickets, your policies — instead of whatever it memorised during training. The idea is simple: before the model answers, you fetch the most relevant pieces of your own data and hand them to it as context.

The hard part isn't the model. After shipping RAG for several clients, we found the same thing every time: the answer is only as good as the chunk you retrieved. If the right passage never makes it into the prompt, no amount of clever wording saves you.

The formula we settled on

We now treat RAG as a pipeline of four measurable stages, and we tune them in order:

  • Chunking — split documents on semantic boundaries (headings, paragraphs), not fixed token counts. A chunk should be one coherent idea.
  • Hybrid retrieval — combine dense vector search (meaning) with keyword search (exact terms like product codes). Each catches what the other misses.
  • Reranking — pull a wide net of ~20 candidates, then use a cross-encoder reranker to pick the best 4–5. This one change removed more errors than any model upgrade.
  • Grounded generation — instruct the model to answer only from the supplied context and cite which chunk it used.
  • The one metric that matters first

    Before we judge a single answer, we measure recall@k — did the correct chunk land in the top k results we sent to the model?

    recall@k = (queries where the right chunk is in top k) / (total queries)
    

    If recall@k is low, debating answer quality is pointless — the model is guessing. We push recall@k above ~0.9 first, then tune the generation prompt. Separating retrieval quality from answer quality is the single biggest unlock.

    Hybrid score, concretely

    For ranking we blend the two signals with a weighted sum, normalised to 0–1:

    score = α · cosine(query, chunk) + (1 − α) · bm25(query, chunk)
    

    We start at α = 0.5 and tune per dataset. Document-heavy, jargon-rich corpora (legal, medical) lean lower (keyword matters more); conversational corpora lean higher.

    What this means for you

    You don't need a bigger model. You need a measured retrieval pipeline. We build it in your stack, wire up a small golden dataset of real questions, and tune against recall@k until the answers are grounded and citable — not confident guesses.

    Work with us

    Want this in your product?
    Let's scope the build.

    We turn the approaches above into working software — in your repo, on your stack.