All research
[ 04 ]Evaluation

Evaluation: how we know an AI feature actually works

5 min read

Evaluation is the unglamorous discipline that separates an AI demo from an AI product. It's how you answer a deceptively hard question: did that change make the system better or worse? Without a way to measure, every prompt tweak is a guess and every release is a gamble.

We learned this the hard way — "improving" a prompt that three people agreed read better, then watching support tickets climb because it quietly broke answers the old version handled. The fix wasn't a smarter model. It was measurement.

Start with a golden dataset

The foundation is a set of real cases — questions pulled from actual usage — each paired with what a good answer should contain. You don't need thousands. Eighty well-chosen cases caught most of our regressions. This dataset becomes the thing you protect.

LLM-as-judge

Grading hundreds of free-text answers by hand doesn't scale, so we use a model to score a model. A separate "judge" rates each answer against the expected one on a simple rubric:

judge(answer, reference) → { correct: 0–1, grounded: 0–1, complete: 0–1 }

Calibrated against a few human-graded examples, an LLM judge tracks human judgement closely enough to run on every change — cheaply and instantly.

Make it a gate, not a report

The discipline that mattered: the eval suite runs on every prompt or model change, and a drop in score blocks the release. We track an aggregate the same way you'd track test coverage:

score = mean(correct) · w₁ + mean(grounded) · w₂ + mean(complete) · w₃

If score regresses against the last known-good baseline, the change doesn't ship until we understand why.

The real lesson

The prompt is the easy, visible part. The eval suite — the golden dataset, the judge, the regression gate — is the thing you're actually building, and the asset that quietly becomes the most valuable thing you own.

What this means for you

When we build an AI feature for you, we build its eval suite alongside it. You get a system whose quality is measured, defended on every change, and improvable on purpose — not by feel.

Work with us

Want this in your product?
Let's scope the build.

We turn the approaches above into working software — in your repo, on your stack.